<Z 



CENTER ON 



EDUCATION 



0 



D 



POLICY 



Answering the Question That matters most 



Has Student achievement increased 
Since no Child Left Behind? 




Center on Education 




(center on 
education) 




answering the Question That matters most 



Has Student Achievement Increased 
Since no Child Left Behind? 



Center on Education Policy 

JUNE 2007 




Answering the Question That Matters Most 



Table of Contents 




Chapter i: Summary of Key Findings 1 

Main Conclusions 1 

Gains in Reading and Math Since 2002 2 

Narrowing Achievement Gaps 3 

Gains Before and After NCLB 4 

Difficulty of Attributing Causes for Gains 4 

Need for More Transparency in Test Data 4 

State- By-State Achievement Trends on the Web 5 

Future Phases of This Study 5 

Chapter 2: What Makes This Study Unique 7 

Key Findings 7 

Background on the Study 7 

Unique Features of This Study 11 

Limitations of This Study 12 

Chapter 3: Achievement Measures Used in This Study 15 

Key Findings 15 

Complexities of Measuring Achievement 15 

Limitations of Percentages Proficient 16 

How We Addressed the Limitations of the Percentage Proficient 22 

Problems with Tests That We Did Not Address 26 

Rules for Analyzing Data 28 

Individual State Profiles on the Web 29 

Chapter 4: Trends in Overall Achievement 31 

Key Findings 31 

How We Analyzed Overall Test Score Trends 32 

Findings about Overall Test Score Trends Since 2002 33 

Pre- and Post-NCLB Trends 43 

State- by-State Summary of Overall Achievement Trends 47 

Chapter 5: Trends in Achievement Gaps 51 

Key Findings 51 

How We Analyzed Achievement Gap Trends 51 

Gaps in Percentages Proficient Since 2002 52 

Effect Size as a Second Indicator 54 

Tradeoffs between Overall Gains and Gap Reduction 56 

Possible Explanations for Gap Trends 56 

Other Subgroups 57 

State- By-State Summary of Achievement Gap Trends 58 



Chapter 6: Comparing State Test Score Trends with NAEP Results 61 

Key Findings 61 

Background on NAEP 61 

Recent NAEP Trends 62 

How State Test Score Trends Compare with NAEP Trends 62 

Possible Explanations for Differences in State Test and NAEP Results 70 

Chapter 7: Quality and Limitations of State Test Data 73 

Key Findings 73 

Disparity between Ideal and Available Data 74 

Availability of Data 75 

Breaks in Trend Lines 78 

Summary of Available Data and Suggestions for Improvement 79 

Appendix: Study Methods 83 

Collecting and Verifying Phase I Data 83 

Analyzing Phase I Data 86 

Attachment A 88 

Attachment B 91 

Glossary of Technical Terms 93 



References 



95 



Answering the Question That Matters Most 




Chapter 1 

Summary of Key Findings 



Since 2002, the No Child Left Behind Act (NCLB) has spurred far-reaching changes in 
elementary and secondary education, all aimed at accomplishing the same fundamental 
goal — to improve students’ academic achievement. As the Congress prepares to reauthorize 
the Act, two related questions matter most: 

1. Has student achievement in reading and math increased since NCLB was enacted? 

2. Have achievement gaps between different subgroups of students narrowed since NCLB 
was enacted? 

To answer these questions, the Center on Education Policy (CEP), an independent non- 
profit organization, conducted the most comprehensive study of trends in state test scores 
since NCLB took effect. We carried out this study with advice from a panel of five nationally 
known experts in educational testing or policy research, and with extensive technical 
support from the Human Resources Research Organization (HumRRO). Although we 
collected data from all 50 states, not every state had enough consistent data to do a com- 
plete analysis of test score trends in reading and math before and after 2002. Based on the 
data that states did provide, we reached five main conclusions. 



Main Conclusions 

1. In most states with three or more years of comparable test data, student achieve- 
ment in reading and math has gone up since 2002, the year NCLB was enacted. 

2. There is more evidence of achievement gaps between groups of students narrowing 
since 2002 than of gaps widening. Still, the magnitude of the gaps is often substantial. 

3. In 9 of the 13 states with sufficient data to determine pre- and post-NCLB trends, 
average yearly gains in test scores were greater after NCLB took effect than before. 

4. It is very difficult, if not impossible, to determine the extent to which these 
trends in test results have occurred because of NCLB. Since 2002, states, school 
districts, and schools have simultaneously implemented many different but 
interconnected policies to raise achievement. 

5. Although NCLB emphasizes public reporting of state test data, the data necessary 
to reach definitive conclusions about achievement were sometimes hard to find or 
unavailable, or had holes or discrepancies. More attention should be given to 
issues of the quality and transparency of state test data. 



Answering the Question That Matters Most 



The study that produced these conclusions had several unique features, designed to address 
the limitations of past research on achievement since 2002. We went to great lengths to 
gather the most current results on state reading and mathematics tests from all 50 states and 
to have all states verify the accuracy of their data. Within each state, we limited our analy- 
ses to test results that were truly comparable from year to year — in other words, that had not 
been affected by such factors as the adoption of new tests or changes in the test score stu- 
dents must reach to be considered proficient. We also compared trends before and after 
2002 to see whether the pace of improvement has sped up or slowed down since NCLB took 
effect. We supplemented our analyses of the percentage of students scoring at or above the 
proficient level — the “magic number” for NCLB accountability — with analyses of effect 
size, a statistical tool based on average (mean) test scores that addresses some of the prob- 
lems with the percentage proficient measure. And we analyzed all of the data — which in a 
typical state included as many as 16,000 individual numbers — as objectively as possible, 
using a consistent set of rules that were developed without regard to whether they would 
lead to positive or negative findings. 

The rest of this chapter summarizes the findings that led us to the five main conclusions. 
Additional key findings can be found at the beginning of the other chapters. 




Gains in Reading and Math Since 2002 

To reach national conclusions about reading and math achievement, we first determined the 
test score trends in each state, looking at both the percentages of students scoring proficient 
and effect sizes where available. The state trends were then aggregated into a national pic- 
ture of achievement that included these and other findings (chapter 4): 

• The number of states showing gains in test scores since 2002 is far greater than the num- 
ber showing declines. For example, of the 24 states with percentage proficient and effect 
size data for middle school reading, 1 1 demonstrated moderate-to-large gains (average 
gains of at least 1 percentage point per year) in middle school reading, and only one 
showed a moderate or larger decline. 

• Five of the 22 states with both percentage proficient and effect size data at the elemen- 
tary, middle, and high school levels made moderate-to-large gains in reading and math 
on both measures across all three grade spans. In other words, these five states showed 
gains according to all of the indicators collected for this study. In reading alone, seven 
states showed moderate-to-large increases across all three grade spans on both measures. 
In math alone, nine states showed similar gains across all three grade spans on both meas- 
ures. The rest of the states had different trends at different grade spans. 

• Elementary school math is the area in which the most states showed improvements. Of 
the 25 states with sufficient data, 22 demonstrated moderate-to-large math gains at the 
elementary level on both the percentage proficient and effect size measures, while none 
showed moderate or larger declines. Based on percentages proficient alone, 37 of the 41 
states with trend data in elementary math demonstrated moderate-to-large gains, while 
none showed moderate or larger declines. 

• More states showed declines in reading and math achievement at the high school level 
than at the elementary or middle school levels. Still, the number of states with test score 
gains in high school exceeded the number with declines. 



• Analyses of changes in achievement using effect sizes generally produced the same find- 
ings as analyses using percentages proficient. But in some cases, the effect size analysis 
showed a different trend. In Nevada, for instance, the percentage proficient in high 
school math decreased, while the average test score increased. In New Jersey the percent- 
age proficient in middle school reading rose slightly, while the average test score dropped. 

• When the percentage of students scoring at the proficient level on state tests is com- 
pared with the percentage scoring at the basic level on the National Assessment of 
Educational Progress (NAEP), states show more positive results on their own tests than 
on NAEP Moreover, the states with the greatest gains on their own tests were usually 
not the same states that had the greatest gains on NAEP. The NAEP tests, however, are 
not aligned with a state’s curriculum as state tests are, so NAEP should not be treated 
as a “gold standard” to invalidate state test results but as an additional source of infor- 
mation about achievement. 



Narrowing Achievement Gaps 

We analyzed trends in test score gaps for major racial-ethnic subgroups of students, low- 
income students, students with disabilities, and limited-English-proficient (LEP) students. 
We looked at both percentages proficient and effect size data where available; effect size data 
were harder to come by for subgroups than for students overall. We considered a narrowing 
or widening of the achievement gap to be a trend for a specific subgroup if it occurred in 
the same subject (reading or math) across all three grade spans (elementary, middle, and high 
school). We compiled trends from the 50 states to arrive at these and other national find- 
ings (chapter 5): 

• Among the states with sufficient data to discern trends by subgroup, the number of states 
in which gaps in percentages proficient have narrowed since 2002 far exceeds the num- 
ber of states in which gaps widened. 

• For the African-American subgroup, 14 of the 38 states with the necessary data showed 
evidence that gaps have narrowed in reading across all three grade spans analyzed, while 
no state had evidence that gaps have widened. In mathematics, 12 states showed these 
gaps narrowing, while only one state showed the gaps widening. Results were similar for 
the Hispanic and low-income subgroups. 

• As with the percentage proficient, the states in which effect size gaps have narrowed out- 
numbered the states in which effect size gaps have widened. However, for states with 
both types of data, there were a number of instances where gap closings in terms of per- 
centages proficient were not confirmed by effect size. Effect sizes seem to give a less rosy 
picture of achievement gap trends. 

• Even for subgroups that showed evidence of gaps narrowing, the gaps in percentages 
proficient often amounted to 20 percentage points or more, suggesting that it will take 
a concerted, long-term effort to close them. 




Center on Education Policy 



Answering the Question That Matters Most 



Gains Before and After NCLB 



Many states had reforms well underway before NCLB, so it is useful to know whether the 
pace of improvement has picked up since NCLB took effect (chapter 4). Only 13 states 
supplied enough years of data to make this determination — too few to know whether the 
findings for this sample represent a true national trend. In nine of these states, test results 
improved at a greater average yearly rate after 2002 than before. In the other four states, the 
pre-NCLB rate of gain outstripped the post-NCLB rate. 




Difficulty of Attributing Causes for Gains 

This report focuses on whether test scores have gone up since the enactment of NCLB. We 
cannot say to what extent test scores have gone up because of NCLB (chapter 2). It is always 
difficult to tease out a cause- and-effect relationship between test score trends and any spe- 
cific education policy or program. With all of the federal, state, and local reforms that have 
been implemented simultaneously since 2002, it becomes nearly impossible to sort out 
which policy or combination of policies is responsible for test score gains, and to what 
degree. In a similar vein, this report does not take a position on how well specific compo- 
nents of NCLB are working or whether the requirements in the current law are the most 
effective means to raise achievement and close test score gaps. 

One more caveat should be emphasized: test scores are not the same thing as achievement. 
Although tests are often viewed as precise and objective, they are imperfect and incomplete 
measures of how much students have learned. Still, state tests are the primary measure of 
achievement used in NCLB and are the best available standardized measures of the curricu- 
lum taught in classrooms. 



Need for More Transparency in Test Data 

The No Child Left Behind Act requires states to report a massive amount of test data and 
attaches serious consequences to these data for districts, schools, and educators. But the data 
on which so much rests are not easy to access in some states and are sometimes inconsistent, 
outdated, or incomplete (chapter 7). Moreover, the data needed to calculate effect sizes or 
determine which subgroups were small or rapidly changing were unavailable in some states, 
even though these data are integral to all testing systems. Reasons for these shortcomings 
include overburdened state departments of education, ongoing corrections in test data, and 
technical or contractual issues with test contractors. These shortcomings are not necessarily 
the fault of state officials — who were generally cooperative in providing or verifying data 
when asked — but these problems complicated our efforts to reach definitive conclusions 
about student achievement. 

It took many months of effort to gather all the data needed for this study and have state 
officials verify their accuracy. Our experience suggests how difficult it would be for the 
average citizen to get information about test score trends in some states, and points to the 
need for greater transparency in state test data. States could improve transparency by tak- 
ing the following steps: 



• Posting test data in an easy-to-find place on state Web sites 

• Providing clear information and cautions about breaks in the comparability of test data 
caused by new tests or changes in testing systems 

• Reporting standard deviations, mean scale scores, numbers of test-takers, and other 
important information listed in chapter 7 



State- By-State Achievement Trends on the Web 

The trends highlighted in this report have been drawn from an extensive set of data on each 
state. Complete profiles of test results and other information for individual states can be 
accessed on the CEP Web site at www.cep-dc.org/pubs/stateassessment. We encourage anyone 
who is interested in trends for a specific state to visit the Web site and find that states profile. 



Future Phases of This Study 

This report describes the findings from phase I of what will be a three-phase study of stu- 
dent achievement. Phase II, which will be completed this summer, involves on-site inter- 
views with state officials in 22 states. Phase II investigates in more detail the trends 
uncovered during phase I of the study and the factors that affect comparability or availabil- 
ity of test data; it also reports information from state officials about how well specific 
requirements of NCLB are working and how the law could be improved. Phase III, which 
will be carried out in the fall and winter of 2007-08, examines student achievement at the 
school district level in three states. 




Center on Education Policy 



Answering the Question That Matters Most 




Chapter 2 

What Makes This Study Unique 



Key Findings 

• This study was designed to be the most comprehensive and thorough study to date on 
trends in state test scores since NCLB was enacted. Instead of just looking at test results 
in a limited number of states, the study analyzed results from all 50 states. And instead 
of taking for granted that the data reported on state Web sites were accurate, the study 
asked states to verify the accuracy of the test data collected. The process of gathering, 
verifying, and analyzing test results from all states turned out to be an arduous task that 
involved a great deal of cross-checking and depended on cooperation from state officials. 

• This study included other unique elements intended to address limitations of past 
research on achievement. To determine whether the rate of improvement has changed 
since NCLB was enacted, the study compared achievement trends before and after 2002. 
Within each state, the study omitted test results that were not comparable because the 
state had made changes to its testing program. Finally, the study used measures in addition 
to the percentages of students scoring proficient on state tests. 

• Although test scores have gone up since the enactment of NCLB, it is difficult to say 
whether or to what extent they have gone up because of NCLB. It is nearly impossible 
to isolate a cause-and-effect relationship between NCLB and test score trends when 
states, school districts, and schools have simultaneously implemented many different yet 
interconnected reforms. 

• Tests scores are not synonymous with achievement. Tests are imperfect and incomplete 
measures of how much students have learned. But for a wide-scale study of achievement, 
test scores are still the best means available to draw inferences about student learning. 



Background on the Study 

Since the No Child Left Behind Act was enacted more than five years ago, it has spurred as 
many changes in elementary and secondary education as any federal policy in U. S. history. 
Most states have revamped and expanded their testing and accountability systems, and some 
have created these systems where none existed before. Districts and schools have revised their 
curricula, expanded programs for struggling students, and reorganized instructional time to 
meet the law’s adequate yearly progress (AYP) requirements. Teachers have changed how 
they teach. And students continue to take more tests than ever. 




Answering the Question that Matters Most 



A great deal hinges on the state reading and mathematics tests that NCLB requires students 
to take yearly in grades 3 through 8 and once during high school. The results of these tests 
are used to determine whether a school or district must take serious steps to raise achievement 
because it has been identified for improvement under NCLB; whether students are eligible 
for subsidized tutoring or public school choice; and, if low performance persists, whether 
teachers and administrators will be replaced or whether a school will be dramatically reorganized, 
converted into a charter school, operated by an outside contractor, or taken over by the state. 

All of the sanctions in NCLB, and all of the changes brought about by the law, are aimed at 
accomplishing the same fundamental goal — to improve the academic achievement of all 
students, including those from historically underperforming subgroups. So in 2007, the year 
that NCLB is up for reauthorization by the Congress, the question that matters most is 
whether student achievement has gone up since the law took effect. 

This report is the first product of a three-phase study of student achievement before and 
after NCLB. The study is being conducted by the Center on Education Policy, an independ- 
ent nonprofit organization. For phases I and II, CEP has received invaluable advice from a 
panel of five nationally known experts in educational testing or policy research, and extraordi- 
nary technical support from our contractor, the Eluman Resources Research Organization. 




STUDY QUESTIONS, PURPOSES, AND DESIGN 

The study on which this report is based aims to answer two big questions, to the extent they 
can be answered now; 

1. Lias student achievement in reading and math increased since No Child Left Behind 
was enacted? 

2. Have achievement gaps between different subgroups of students narrowed since No 
Child Left Behind was enacted? 

To explore these questions, CEP designed a three-phase study, with the help of the expert 
panel and HumRRO on phases I and II. During phase I, which lasted 14 months, HumRRO 
staff collected various types of test data and other information from every state. CEP and 
HumRRO analyzed these data to determine trends in overall student test scores and achieve- 
ment gaps in states with sufficient data. Phase II of the study, which will be completed this 
summer, involves on-site interviews with state officials in 22 states. The goal of these inter- 
views is to investigate further the trends uncovered during phase I, learn more about changes 
in state testing systems and factors affecting availability of test data, and gather information 
about how well specific requirements of NCLB are working and how they could be 
improved. Phase III, which CEP has designed and will carry out in the fall and winter of 
2007-08, examines student achievement at the school district level in three states. 

This report has two main purposes, one informational and one educational. The first purpose 
is to document our findings from phase I in response to the two study questions. With this 
report we have put the most comprehensive and current data available about student test 
results in reading and math from all 50 states into the hands of policymakers to inform their 
discussions about reauthorization. A special benefit is the series of 50 online profiles, one for 
each state, developed by CEP and HumRRO to accompany the report. Each profile 
contains a rich store of test results and other information for that state. The profiles can be 
accessed on the CEP Web site at www.cep-dc.org/pubs/stateassessment. 



In this report, we have also provided our own analyses of the data, which we conducted as 
objectively as possible, with support from HumRRO and independent of any special interests. 

Although many states test other subjects, this study focuses on reading and math achieve- 
ment because these are the only two subjects that states are required to test for NCLB 
accountability purposes. Although NCLB requires states to test science by school year 2006-07, 
the science results are not used for accountability. 

The second purpose of the report is to educate policymakers and others on what can and 
cannot be known about student achievement, based on available data. With the reauthoriza- 
tion of NCLB underway, people will use test scores to tell the story they want to tell. 
Everyone interested in NCLB needs to be very careful about reaching conclusions based on 
flawed or simplistic interpretations of data, or believing claims that go beyond what the data 
can support. Positive trend lines in test results may indicate that students have learned more, 
but they may also reflect easier tests, lower cut scores for proficiency, changing rules for test- 
ing, or overly narrow teaching to the test. Similarly, flat-line results could signal no change 
in achievement, or they could mean that the test is not a sensitive measure of the effective- 
ness of the instruction students are receiving. And not all states have sufficient, comparable 
data to allow valid conclusions to be drawn about trends in overall student achievement or 
performance gaps before and after NCLB took effect. 

In this climate, it is critical that policymakers and the public understand the quality and 
limitations of the available test data, the types of data that are not routinely available, and 
the factors that could distort trends in test results. To fulfill this educational purpose, we 
have included information about these issues in chapter 7. 




CEP’S EXPERIENCE WITH NCLB RESEARCH 

CEP is uniquely positioned to lead a study of student achievement since enactment of the 
No Child Left Behind Act. This special report on achievement trends represents a continuation — 
and in some ways the pinnacle — of a broader national study of federal, state, and local 
implementation of NCLB that CEP has been conducting since 2002. This broader study 
has been based on data from an annual survey of all 50 states, an annual survey of a nationally 
representative sample of between 274 and 349 responding school districts per year, and 
annual case studies of up to 43 school districts and 33 schools. 

Since 2002, we have issued annual reports on our findings from this broader work. This 
year, we are publishing separate reports addressing different aspects of NCLB. Several have 
been released and more will be published in the coming weeks. 1 This report on achievement 
is part of that set. All of our past and current NCLB reports are available at www.cep-dc.org. 

Since 2004, we have included questions about achievement in our state and district surveys. 
In separate questions about language arts and math, we asked state and district officials 
whether student achievement was improving, declining, or staying the same, based on the 
assessments used for NCLB. We also asked a series of questions about achievement gaps for 
specific subgroups of students. Whenever we have asked these achievement questions, the 
majority of state and district officials have responded that student achievement is improving 



1 In 2007, CEP has already published reports on NCLB state accountability plans, school restructuring in California and Michigan, 
state monitoring of supplemental educational services, and state capacity to administer NCLB. Forthcoming CEP reports on 
NCLB will deal with teacher quality requirements, assistance to schools in improvement, curriculum and instructional changes, 
school restructuring in Maryland, and Reading First. 



Center on Education Policy 



Answering the Question that Matters Most 



and gaps are narrowing. In this year’s state survey, for example, 25 states reported that student 
achievement was improving in language arts, based on test results from 2005-06; 15 states 
said that achievement in this subject was staying the same, and 3 reported that it was declin- 
ing. In math, 27 states reported improvements in achievement, 11 reported flat achieve- 
ment, and 4 reported declines (CEP, 2007a). Compared with last year’s survey responses, 
fewer states reported improvements this year and more reported flat achievement. 

These survey results were based on self-reports and had other limitations. Recognizing these 
limitations, we designed this achievement study, which builds on our prior NCLB research 
but goes well beyond it by examining test scores directly. 




ROLES OF THE EXPERT PANEL AND HUMRRO 

To develop a sound methodology and provide expert advice at all stages of the study, CEP 
assembled a panel of some of the nation’s top scholars on education testing and education 
policy issues. Panel members included the following: 

• Laura Elamilton, senior behavioral scientist, RAND Corporation 

• Eric Elanushek, senior fellow, Hoover Institution 

• Frederick Hess, director of education policy studies, American Enterprise Institute 

• Robert L. Linn, professor emeritus, University of Colorado 

• W. James Popham, professor emeritus, University of California, Los Angeles 

Jack Jennings, CEP’s president and CEO, chaired the panel. The panel met four times in 
Washington, D.C.: March 2006, September 2006, January 2007, and April 2007. At these 
meetings, the panel developed the study methodology, reviewed the initial state data, developed 
procedures for analyzing state data, and reviewed drafts of this report. CEP staff and con- 
sultants and HumRRO staff also attended the panel meetings. In addition, the panel held 
formal telephone conferences, reviewed study documents, and provided informal advice by 
e-mail and phone. The panel members’ wealth of knowledge has contributed immeasurably 
to the quality of this study. 

CEP contracted with HumRRO, a nonprofit research organization with considerable experience 
in evaluating testing systems and standards-based reforms, to collect, compile, and vet the 
quality of the enormous amount of data required for this achievement study. HumRRO also 
did the initial analysis of trends in each state. CEP is deeply grateful to the HumRRO staff, 
whose tireless and capable efforts have been essential to this study. 

Although the panel members and HumRRO staff reviewed all drafts of this report, we did 
not ask them to endorse it, so the findings and views expressed here are those of CEP 



Unique Features of This Study 



From the first meeting of the expert panel, we set out to design the most comprehensive and 
thorough study of state test scores since the passage of NCLB. We wanted the study to be 
methodologically sound, feasible, and useful to policymakers and the public, and to build 
on previous research on this issue, including CEP’s past research. We believe this study has 
met those objectives but, as explained below, it involved a much more intensive effort than 
we initially realized. 



COMPREHENSIVE SCOPE 

A study of test score trends that focused on a limited number of states would immediately 
raise concerns about how the states were selected and whether conclusions were biased. 
Therefore, this study aimed to include various kinds of test results from as many states as possible. 
Ultimately, we obtained data from all 50 states. The effort involved, however, made us appre- 
ciate why this type of study has not been done before and why it cannot be easily replicated. 

Most other studies of NCLB-related achievement use published state or local data on the per- 
centages of students scoring at the proficient level on state tests, and take for granted that 
these data are accurate. Early in our research, however, we found holes and discrepancies in 
the published data, described in detail in chapter 7. Therefore, our study made an extra effort 
to have states verify the accuracy of the test data we gathered. Using the process outlined in 
the appendix, we sent state officials a complete file of the data we had collected for their state 
and asked them to check their accuracy, make corrections, and fill in missing information. 
State officials were also asked to sign a verification checklist. This verification process was long 
and complicated, as noted in chapter 7; most states did make modifications in their data. 

A study of this breadth and depth would not have been possible without the cooperation of 
the states. We appreciate greatly the considerable efforts made by officials from individual 
states and from the Council of Chief State School Officers to voluntarily provide us with 
and verify their data. 

Analyzing the data we collected was also a complicated and time-consuming process. In a typi- 
cal state, the data tables developed by ElumRRO included as many as 16,184 individual num- 
bers, which ElumRRO staff and a CEP consultant scrutinized to determine achievement trends. 

The data used to arrive at the findings in this report represent the best information we could 
obtain by the mid-January collection deadline for phase I of the study. Still, the information 
in this report and the accompanying state profiles represents a snapshot in time. 




OTHER UNIQUE FEATURES OF THE STUDY 

Working together, CEP, the expert panel, and ElumRRO designed the phase I study to 
include the following unique elements: 

• State- centered approach. Because each state has its own assessment system, aligned to 
different content standards and using different definitions of proficiency, it can be perilous 
to combine test results across states when analyzing achievement trends. Still, state tests 
are the main measure of achievement under NCLB and the best available standardized 
measures of the curriculum being taught in classrooms. This study makes separate judgments 
about achievement in each state, based on that states test data, and then summarizes 
those judgments across states to arrive at a national picture. 



Center on Education Policy 



Answering the Question that Matters Most 




• Pre- and post-NCLB trends. Many states had gotten serious about education reform 
years before NCLB took effect. A study that did not look at trends before NCLB would 
raise questions about whether gains in achievement after NCLB were just a continuation 
of previous trends. Furthermore, a study that did not include results from the most 
recent round of state testing would not accurately reflect current progress in achieve- 
ment. To the extent possible, this study looks at test score trends before and after 2002, 
the year NCLB was enacted, to determine whether the trends have changed direction 
and whether the pace of improvement has sped up or slowed down since NCLB. On the 
advice of the expert panel, we tried to obtain test data from 1 999 through 2006, or for 
whichever of these years states had data available. In nearly all states, the most recent data 
available during phase I of this study were from tests administered in 2005-06. 

• Breaks in data. Often the test results reported for NCLB are treated as one long, uninter- 
rupted line of comparable data. But as explained in chapter 3, many states adopted changes 
in their assessment systems since 2002 that created a “break” in test data — meaning that 
results after the change are not comparable to results before the change. This study only ana- 
lyzed test results for years with an unbroken trend of comparable data. 

• Additional analyses beyond percentages proficient. To make judgments about student 
achievement, NCLB relies mainly on a single indicator, the percentage of students scor- 
ing at or above the proficient level on state tests. Like every measure of achievement, this 
one has its limitations. As explained in chapter 3, we supplemented our analysis of per- 
centages proficient, where possible, with rigorous alternative analyses based on effect 
sizes, which are derived from mean, or average, test scores. (Definitions of these and 
other technical terms can be found in the glossary at the end of this report.) 



Limitations of This Study 

Even with the steps described above, this study makes judgments about student achievement 
based on less than perfect information. In addition to the test construction issues discussed 
in chapter 3, two broader types of limitations, described below, are particularly noteworthy: 

DIFFICULTY OF ATTRIBUTING CHANGES TO NCLB 

This report focuses on whether student achievement has improved since the enactment of 
NCLB. It is very difficult to determine whether students are learning more because of 
NCLB. Isolating the cause-and-effect relationship of any education policy is often imprac- 
ticable. With a policy as far-reaching as NCLB, it becomes nearly impossible when states, 
districts, and schools are simultaneously implementing so many different yet interconnected 
policies and programs. If student achievement has risen since NCLB took effect, is this due 
to NCLB, or to state or local reforms implemented at roughly the same time, or to both? If 
both, how much of the improvement is attributable to state or local policies and how much 
to federal policies? Using multiple methods of analyzing achievement will not tease out the 
causes of gains or declines. 

In a similar vein, this study does not take a position on how well specific components of 
NCLB are working or whether the requirements in the current law are the most effective 
means to raise achievement and close test score gaps. 



AN IMPERFECT MEASURE OF ACHIEVEMENT 



Like virtually all studies of achievement, this one relies on test scores as the primary measure 
of how much students are learning. But test scores are not synonymous with achievement. 
Although tests are often viewed as precise and very objective, they are imperfect and incomplete 
measures of learning. Only certain types of knowledge and skills get tested on large-scale 
state tests — generally those that can be assessed with questions that are easy to administer 
and score. 

In addition, test scores can go up over time without actually indicating that students have 
learned more; for example, several researchers have observed a “bump” in scores in the first 
few years after a test has been introduced, as students and teachers become more familiar 
with its format and general content (Hamilton, 2003; Linn, Graue & Sanders, 1990; 
Koretz, 2005). Moreover, tests vary in their instructional sensitivity — in other words, how 
well they detect improvements due to better teaching (Popham, 2006). 

Still, tests are the best means available to draw inferences about student learning, especially 
across schools, districts, and states. That is why test results, despite their limitations, are the 
focus of this study. 




Center on Education Policy 



Answering the Question that Matters Most 




Chapter 3 

Achievement Measures Used in This Study 



Key Findings 

• The percentage of students scoring at the proficient level on state tests — the “magic 
number” used for NCLB accountability and the only measure of achievement the Act 
requires states to collect and report — may appear to be accurate, objective, and consis- 
tent, but in some cases it can be misleading. Definitions of “proficient” performance vary 
so much from state to state that the percentage proficient should not be used for com- 
parisons between states. Even within the same state, percentages proficient may not be 
comparable from year to year due to federal and state policy changes. Moreover, the per- 
centage proficient provides no information about progress in student achievement that 
occurs below or above the proficient level. 

• In this study, we used the percentage proficient as one measure of achievement. To 
address its limitations, however, we used a statistical tool called effect size as a second 
measure. Because effect sizes are based on mean, or average, test scores in conjunction 
with the dispersion of scores, they capture changes in achievement below and above the 
proficient level. They also avoid a problem that arises when percentages proficient are 
used to analyze achievement gaps for student subgroups — namely, that the gaps between 
higher- and lower- achieving subgroups can look different, depending on where a state 
has set its cut score for proficient performance on the scoring scale for the test. 

• This study used a set of rules, applied consistently across all states, to determine such 
issues as when breaks had occurred in the comparability of test data, when subgroup 
scores should be approached with caution, and what constitutes a trend in achievement. 

• Even if an ideal amount of test score data had been available for every state, policymak- 
ers and others should still be cautious when interpreting test score trends because of the 
many ways that a states test can change from year to year. There is a certain degree of 
“fuzziness” or potential distortion in state test results that is derived from the tests them- 
selves and from the way they are created, administered, and scored. 



Complexities of Measuring Achievement 

Measuring student achievement is a much more complex proposition than measuring a child’s 
height with a yardstick. Although the tests used for accountability under the No Child Left 
Behind Act are a logical starting point for a study of achievement since the law took effect, 
there are different ways of looking at test data and different ways of defining improvement. 




Answering the Question That Matters Most 



In this chapter, we review the limitations of the primary measure of achievement used by 
NCLB — the percentage of students scoring at or above the proficient level on state tests. We 
describe how we addressed many of these limitations by taking into account “breaks” in 
comparable data, supplementing the percentage proficient with additional measures, and 
flagging data that should be approached with caution. In addition, we explain the rules devel- 
oped by CEP, HumRRO, and the expert panel for deciding which specific data to include in 
our analysis and determining whether improvement has occurred. We add some cautions 
about why test results — even if carefully analyzed in multiple ways — may still not provide a 
completely accurate picture of student performance trends. We conclude with a list of the 
detailed information available in the state-by-state profiles posted on the CEP Web site. 




Limitations of Percentages Proficient 

The main measure used to gauge student achievement under the No Child Left Behind Act 
is the percentage of students scoring at or above the proficient level on state tests. The federal 
law does not define proficient performance. Instead, NCLB directs each state to set its own 
proficiency standard and measure student progress toward it with its own tests. 
Consequently, “proficient” means different things in different states. States vary widely in 
curriculum, learning expectations, and tests, and they have defined proficiency in various 
ways, using different cut scores. (The cut score is the score students must meet or exceed on 
a test to be counted as proficient.) States have also developed different rules for how to 
calculate the percentage proficient. 

Even with this variety, the percentage proficient is NCLB’s magic number — it determines 
whether schools and districts are making adequate yearly progress toward the goal of 1 00% 
proficiency in 2014 or whether they are “in need of improvement.” It is also the only form 
of test data that states are required by the Act to collect and report, which they must do for 
the state as a whole and for school districts, schools, and subgroups of students. 

On one hand, the percentage proficient measure has the advantage of being easily understood 
by policymakers, the media, parents, and the public. It also addresses the concern that large 
numbers of students are not achieving at an adequate level by giving a snapshot of how many 
students have met the performance expectations set by their state. On the other hand, the 
percentage proficient has limitations as a measure of whether student achievement has 
increased. People assume that this measure is accurate, objective, and consistent, but in reality 
it can sometimes be misleading. Three limitations of the percentage proficient are particularly 
problematic in studies of achievement trends over time: a lack of comparability within the 
same state, a lack of comparability across states, and omission of progress above and below 
the proficient level. 



THE PROBLEM OF COMPARABILITY WITHIN STATES 

Since NCLB was first enacted, states have made policy changes over the years that have 
affected calculations of the percentage proficient. Although these changes have been made 
with the approval of the U.S. Department of Education (ED), they can influence the com- 
parability of percentages proficient from one year to the next in the same state. In essence, 
certain changes have made it easier for some students to be deemed proficient even if they 
haven’t learned more. As a consequence, 62% proficient in 2006 may not mean the same 
thing as it did in 2002. Similarly, an increase from 62% to 72% proficient between 2002 
and 2006 does not necessarily mean that students’ raw test scores have gone up a proportionate 



amount or that students have learned that much more. Rather, an indeterminate amount of 
this increase may be due to policy changes, including some of the changes described in depth 
in a recent CEP report on state accountability plans (CEP, 2007c). 

One notable state change that affects the percentage proficient involves retesting — students 
retaking a state test (typically a different form of the same test) that they had not passed the 
first time. Initially, ED held to the “first administration” rule for tests used for NCLB — the 
score that counted was the one a student earned the first time the test was taken. Many 
states, particularly those with high school exit exams, allow students multiple opportunities 
to pass, which conflicted with ED’s rule. In 2004, ED revised its policy and began permit- 
ting states to count proficient scores on retests for AYP purposes, and to “bank” the scores 
of students who pass the exams early and count these scores as proficient in a subsequent year. 

In another relevant policy change, a few states put a “standard error of measurement” of plus 
or minus a few points around an individual student’s test score. 2 This practice is intended to 
address measurement error that occurs in test scores due to differences in the sample of ques- 
tions that appear on different forms of the same test, student guessing, students’ physical con- 
dition or state of mind, distractions during testing, less than perfect agreement among raters 
who score written responses to open-ended test questions, and other factors unrelated to stu- 
dent learning. In states that use this type of standard error, some students are counted as pro- 
ficient even though their scores fall slightly below the proficiency cut score. This has the 
effect of inflating the percentage proficient figure. 

Changes in federal regulations and guidance have also had an impact on percentage proficient 
calculations. Most notably, ED issued major rule changes that affected which students with 
disabilities and limited-English-proficient (LEP) students are tested for NCLB accountability 
purposes, how they are tested, and when their test scores are counted as proficient under 
NCLB (see box A). Ultimately, these adjustments have made it easier for some students in 
these subgroups to be counted as proficient, which in turn has affected the comparability of 
test results for these subgroups over time. The impact has been significant enough to make 
it inadvisable to draw comparisons of the performance of these two subgroups between 
2002 and 2006. 




The comparability of the percentage proficient measure within the same state can also be 
affected by significant shifts in subgroup demographics and numbers. Many states have 
experienced rapid growth in the Elispanic and LEP subgroups. For example, in just two 
years (2004 to 2006), Tennessee saw a 46% increase in the number of students in the 
Elispanic subgroup in 4 th grade, as measured by the number of students taking the state 
reading test. 

Rapid changes in the number of students tested can affect achievement trends in ways that do 
not reflect the effectiveness of an educational system, complicating efforts to determine trends 
across years for the same subgroup or to compare trends in gaps between different subgroups. 



2 Many more states use a related method referred to as a “confidence interval,” which puts a band of plus or minus a few per- 
centage points around a school’s or subgroup’s percentage proficient for the purpose of determining adequate yearly progress. 
However, this technique does not affect the percentage proficient that is reported at the state level. For more on confidence 
intervals, see CEP, 2005. 



Center on Education Policy 



Answering the Question That Matters Most 



Box A. 



Students with Disabilities and Limited-English-Proficient Students: 
A Moving Target 




Since the No Child Left Behind Act took effect, states and school districts have encountered continuous 
problems in attempting to square the law’s testing and accountability requirements with the unique 
needs of students with disabilities and limited-English-proficient students. In response, the U.S. 
Department of Education made several policy changes to accommodate these subgroups while still 
holding states, districts, and schools accountable for these students’ performance in the same way as 
other subgroups. Over the past few years, these policy decisions have affected which students are 
counted in these subgroups, which students are tested, how they are tested, and how their test scores 
are counted. 

Before NCLB, it was not uncommon for students with disabilities and LEP students to be exempted from 
standardized testing altogether or given different tests than other students (National Research Council, 
1997). NCLB included a requirement for students in these two subgroups to be tested with the same 
tests and standards as other students, but in the early years of the law, some school districts were 
unsure how to implement this requirement, and states and districts had various policies for how to test 
students in these subgroups (CEP, 2003; 2004). Some districts gave the regular state test with no 
modifications, which made it difficult for students with cognitive or learning disabilities to score at the 
proficient level. Other districts made liberal use of test accommodations or modifications and tested 
some students with disabilities with assessments geared to their learning level (alternate standards) 
rather than their grade level— practices that likely helped some students reach a proficient score. 

Experience has shown that it is very difficult for the subgroup of students with disabilities to score at the 
proficient level on regular state tests. In 2003, ED issued regulations that allowed states to give students 
with significant cognitive disabilities an alternate assessment geared to alternate standards. However, 
the number of scores from these alternate assessments that were counted as proficient for AYP purposes 
could not exceed 1% of all tested students. Another policy change in April 2005 expanded the 
opportunities for students with disabilities to take alternate assessments by allowing additional 
students to be tested against “modified” standards, with a cap of 2% of all students. The modified tests 
allowed more students with disabilities (and to a lesser extent, all students) to be counted as proficient, 
but it also meant that the percentages proficient for the disabilities subgroup would not be truly 
comparable from year to year and would not be a reliable measure of achievement trends. 

Federal policy changes have also affected the comparability of test results for LEP students. Linder ED’s 
initial, strict interpretation of NCLB, this subgroup as a whole could not, by definition, achieve 
proficiency in English because once a student became proficient in the English language, he or she was 
moved out of the LEP subgroup, and those remaining were not proficient. In February 2004, ED issued a 
policy allowing states to exempt immigrant students in their first year of enrollment in a U.S. school from 
taking the regular state English language arts test. States could also include former LEP students in the 
LEP subgroup for two years after they reached English proficiency, which of course increased the 
percentage of LEP students scoring proficient in reading or English language arts. But again, this policy 
change means that the percentage proficient for this subgroup would not be a reliable indicator of trends 
in achievement. 

For these reasons, we do not include trends for the students with disabilities and LEP subgroups in the 
national summary on achievement gaps in chapter 5. We do, however, report performance for these 
subgroups within the individual state profiles. 

Source: Center on Education Policy, 2007c. 




THE PROBLEM OF COMPARABILITY BETWEEN STATES 



Because the definition of “proficient” varies so much from state to state, it is inadvisable to 
use the percentage proficient measure to compare student achievement in one state with 
student achievement in another. A seemingly identical outcome — say, 75% of high school 
students scoring proficient on a math test — will mean two different things in two different 
states in terms of what students actually know and can do in math. For this reason, we have 
avoided making these sorts of state-to-state comparisons in this report, and we strongly urge 
others to avoid doing so. 

Many of the policy changes described above that affect the comparability of the percentage 
proficient measure within the same state also affect its comparability between states. 
Between-state comparisons are further confounded by decisions about where to locate the 
cut score on the scoring scale for a particular test. States can set a low cut score or a high 
one, so that more students or fewer students are deemed proficient, and states have made 
very different choices. In Tennessee, 88% of 3 rd grade students reached the proficient level 
in math last year; in Hawaii, the figure was only 30% proficient. It is unlikely that there are 
such huge discrepancies in student achievement between these states. It is more likely that 
these results largely reflect differences in the difficulties of the tests and the location of 
proficiency cut scores. 

The location of the cut score also creates problems in comparing trends over time in the per- 
centage proficient across states. This is because the same amount of percentage point 
increase means different things at different points on the score distribution. A 1 0% increase 
in the percentage proficient is much more difficult to achieve when it involves an improve- 
ment from 85% to 95% than when it involves a gain from 50% to 60%. Moreover, the loca- 
tion of the proficiency cut score can affect how large achievement gaps between subgroups 
appear to be, and can make it difficult to accurately compare progress in narrowing these 
gaps (see box B). 




THE PROBLEM OF ONE LEVEL OF ACHIEVEMENT 

A persistent criticism of the percentage proficient measure raised by educators is that it provides 
a picture of student test performance that is limited to just one level of achievement — 
proficient — and provides no information about achievement above or below that level. 

For example, in a school with large numbers of low-performing students, teachers and 
administrators may be working very hard to improve achievement and may be making 
progress in boosting students from the “below basic” to the “basic” levels but raising fewer 
students to the higher level of “proficient.” It is possible for test scores to increase without 
that increase being reflected in the percentage proficient if the increase occurs below the 
proficient level. Despite progress at the lower achievement levels and increasing test scores, 
a school or district would fail to make adequate yearly progress under NCLB and would be 
subject to the law’s sanctions. Similarly, schools do not receive credit for gains by students 
who are already performing at or above the proficient level. In response to this problem, ED 
has recently allowed states to experiment with “growth models” to calculate adequate yearly 
progress, but only a few states have received permission to use these methods so far. 



Center on Education Policy 



Answering the Question That Matters Most 



Box B. Cut Scores and Gaps 




Some states appear to have shown more progress than others in narrowing gaps in percentages proficient 
between, for example, African American and white students. Can one conclude that educational practices 
aimed at reducing these gaps are working better in one state than another? Not necessarily. 

Changes in instruction can affect the size of achievement gaps, but so can other factors. An important issue 
to consider when looking at achievement gaps is the location of the proficiency cut score— that is, the test 
score students must reach or exceed to be considered “proficient” on a test. Research shows that where the 
proficiency cut score is set makes a difference in the apparent size of the gap. If a proficiency cut score is very 
high or low, so that almost everyone reaches the cut score or almost nobody reaches it, there is little 
apparent gap. A cut score closer to the mean test score will be more sensitive to the achievement gap. 

This was illustrated graphically by Paul Holland (2002). Figure 1 shows the results of the 2000 
administration of the math portion of the National Assessment of Educational Progress for 8 th grade 
African American and white students. The test was scored on a scale of 0-500, with the cut score for the 
“basic” level of achievement set at 262, “proficient” at 299, and “advanced” at 333. 

The graph shows the percentage of students in each group that scored at or below a certain level on 
NAEP. The x axis is the score, and the y axis is the percentage of students achieving at or below that 
score. So, about 25% of white students scored at or below 262 (basic)— marked with a dashed vertical 
line in figure 1— while 75% exceeded this score. About 70% of African American students scored at or 
below 262, while about 30% exceeded this score. Therefore, at the basic level, the achievement gap 
between African American and white students is about 45 percentage points— quite large. 

Figure 1. African American/ White Achievement Gap 
NAEP Mathematics 2000 Grade 8 




However, the achievement gap picture changes as one moves along the score scale. At the “proficient” 
level of 299— marked with a solid vertical line in figure 1— the black/white gap shrinks to about 30 
percentage points. As one moves toward the advanced cut score of 333 (shown in figure 1 as a dotted 
vertical line), the gap continues to shrink until it reaches about 6 percentage points at the advanced 
level. The same is true at the low end of the scale, where the gap is also a lot smaller. 

This NAEP illustration shows that focusing on a cut score of 262, 299, or 333 will have a dramatic impact 
on the apparent size of the achievement gap between African American and white students. The gap is 
larger at the middle of the NAEP score scale than at the extremes. 




Box B. (continued) 



Another way of illustrating this phenomenon is provided in figure 2, which consists of two normal 
distributions of test scores for two subgroups of students, subgroup A and subgroup B. A normal curve 
graphically represents the typical way that students’ scores are distributed on a test. Most scores are fairly 
close to the middle or average, and fewer scores are at the very low or very high ends. A hypothetical 
example is displayed in figure 2: the initial cut score (cut score 1) is set so that 84% of the students in 
subgroup A score at or above the cut score, compared with 50% of the students in subgroup B. (The areas 
to the right of the cut score under both curves represent the students who pass.) Therefore, the gap in 
percentages proficient between the two groups is 34 percentage points. 

If a state were to set an easier cut score, represented by cut score 2 in figure 3, more students would 
meet or exceed it. At that point, 98% of subgroup A students and 84% of subgroup B students would 
pass, and the achievement gap would be reduced to 14 percentage points. 

Figure 2. Size of Gaps in Percentages Proficient with a Cut Score at the Mean 



Subgroup A 




Cut Score 1 



Figure 3. Size of Gaps in Percentages Proficient with a Lower Cut Score 



Subgroup A 




1 

Cut Score 2 



Therefore, anyone examining achievement gaps must take into account the location of the proficiency 
cut score. If a cut score is very high or low, there is little apparent gap. A cut score closer to the mean 
test score is a more sensitive measure of the achievement gap. In addition, discussions of changes in 
achievement gaps ideally should take into account any possible changes in cut scores. 

Source: Holland (2002), and Center on Education Policy. 



Center on Education Policy 




Answering the Question That Matters Most 



How We Addressed the Limitations of the Percentage Proficient 



Our analyses began with the percentage proficient because this is the primary measure used 
to track progress under NCLB and is available from every state. Then, to address the 
limitations of the percentage proficient measure described above and other factors that 
could skew achievement trends, we took two additional steps: carefully identifying breaks in 
testing programs that limit trend analyses, and analyzing student achievement using different 
measures. These other measures by themselves are not perfect either. They cannot take into 
account differences between states in the standards and tests. They do provide us, however, 
with alternatives or additional data points that can support or contradict percentage proficient 
trends. In some cases, limited comparisons can be made between states using these alter- 
native measures. 




IDENTIFYING BREAKS IN TESTING PROGRAMS 

Many states have changed their testing systems since 1999 as a result of NCLB and/or their 
own state policies. Often these changes create a break in the test data that makes it inadvisable 
to compare test results before and after the change. For instance, if a state introduced a new 
test in 2004, the results should not be compared with results from the old test given in 2003 
(unless special equating procedures were conducted). Similarly, when a state introduces new 
state content standards that outline what students are expected to learn at different grades, 
usually the state must also redesign its testing program to ensure the tests are aligned with 
the new standards; this situation also results in a break in comparability. Chapter 7 describes 
the specific reasons we found for breaks in data. 

Major changes, such as the adoption of a new test, are usually announced and explained to 
the public. But not all changes are publicized. Sometimes states change their cut scores, 
including the proficient score for NCLB purposes — a process that may or may not be done 
quietly. Once new cut scores are set, the percentage proficient results cannot responsibly be 
compared with those from earlier years. (Mean scale scores would still be comparable if the 
tests themselves had not been changed in other ways). 

There are many educationally sound reasons why states make changes to their testing 
programs, such as better aligning tests with the curriculum taught in classrooms. 
Nevertheless, this situation makes it difficult or even impossible to track long-term trends in 
achievement. Ideally, the best data on trends come from states that had the same (or 
equated) assessments in place through all or most of the period from 1999 to 2006. 

To determine whether states made alterations to their testing programs that could affect the 
comparability of test results during our period of analysis (1999 through 2006), we collected 
various descriptive information from each state, including major changes in testing programs. 
(The specific information collected is listed in the appendix.) Data from a state that intro- 
duced a new test or changed its proficiency cut score had to be examined closely, because 
often these data were not suitable for trend analyses. Identifying breaks in testing programs 
helped to address the problem described above of year-to-year incompatibility of test results 
in the same state. After identifying the breaks, we limited our analysis to those years that had 
an unbroken line of comparable test results. 



COLLECTING MEANS AND STANDARD DEVIATIONS 



In addition to gathering data on percentages proficient, we also collected mean scale scores 
and standard deviations, explained below. These indicators give a more complete picture of 
the entire distribution of student performance, including performance above and below the 
proficient level. Examining these indicators helped us address the differences in state definitions 
of proficient performance and capture improvements across the entire range of high- and 
low-performing students. 

Mean test scores provide a different perspective on student performance than the percentage 
proficient. The mean is the same as an “average” and is figured by adding up all the individual 
test scores in a group and dividing them by the number of people in the group. Mean test scores 
are expressed on an interval (numerical) scale and permit more rigorous quantitative analysis 
than a simple determination of whether a student falls into the proficient or not proficient cat- 
egory. Mean test scores also provide a more accurate measure of achievement gaps because, as 
explained in box B, the size of the gap depends highly on where the proficiency cut score is set. 

When considered along with the percentage proficient, means provide additional information 
about the overall distribution of test scores. Consider a situation in which the percentage of stu- 
dents scoring at or above the proficient level in a particular state remains at 40% in both 2005 
and 2006, suggesting no improvement. However, the states mean test score might have gone up 
during that same period if students who were performing above the proficiency cutoff score 
achieved higher scale scores on the test in 2006. Or, to present another scenario, the mean could 
also rise if many students who scored below the proficient level earned higher scale scores but not 
enough to reach proficiency. In either case, the mean score might show progress not captured by 
the percentage proficient measure. Using mean scores also removes the uncertainty about com- 
parability that arises when proficiency cut scores change. However, mean scores would not help 
to reveal trends in overall achievement trends or gap trends if the test itself has been changed. 

The standard deviation is a measure of how spread out or bunched together test scores are 
within a data set. It is a way to measure the distance of all scores from the mean score. This 
statistic gives more information about the entire distribution of test scores than the percentage 
proficient does. A standard deviation can be calculated for any batch of test scores. If test 
scores are bunched close together (meaning all students score close to the mean), then the 
standard deviation will be small. Conversely, if test scores are spread out (meaning that many 
students score far from the mean), then the standard deviation will be large. Box C provides 
further explanation of standard deviations. 




ANALYZING CHANGES IN EFFECT SIZES 

Using means and standard deviations, we were able to compute an effect size statistic called 
Cohen’s D (Cohen, 1988; Willingham & Cole, 1997). Effect sizes provide a standard index 
to gauge achievement trends and gap trends; simply put, they are a measure of growth 
compared to a standard deviation. 

An effect size is computed by subtracting the year 1 mean test score from the year 2 mean 
score and dividing by the average standard deviation of the two years. Where there has been 
no change in the average score, the effect size is 0. An effect size of +1 indicates a shift 
upward of 1 standard deviation from the previous years mean test score (in practice, effect 
sizes tend to be much smaller than 1 for mean changes from year to year). Even if two states 
have widely varying score scales and proficiency cut scores, the effect size statistic describes 
annual changes in the mean in terms of the tests’ standard deviations. 



Center on Education Policy 



Answering the Question That Matters Most 



Effect size results are a little more difficult for many readers to interpret than the percentage 
proficient. What does it mean, for example, that the reading score of Delaware 4 th graders 
went up by a total of 0.19 of one standard deviation between 2002 and 2006? Is it a big 




Box C. What Are Standard Deviations? 



Curve figures like the ones below are used to graphically represent the distribution of scores on any 
administration of a test. The largest numbers of test-takers’ scores cluster close to the middle or high 
point of the curve, while fewer scores are situated at the low and high extremes. 

Three areas on a standard normal curve are useful for interpreting test scores. The first is at point o, which 
is the mean, or average, test score. Fifty percent of the scores are below the mean and 50% are above. 




The second area is within +1 or-i standard deviation from the mean; 68% of the scores on a given test fall 
within this area. One standard deviation above the mean captures 34% of scores (half of 68%). 




The third area of interest is between +2 or -2 standard deviations and accounts for 95% of the scores on 
a given test. 




Continued on page 25 




Box C. (continued) 



Let’s say that a test is scored on a scale from 1 to 1000 and that the mean score is 500 and the standard 
deviation is 80. This means that the scores of 68% of test-takers are between 420 (500 - 80) and 580 
(500 + 80). Similarly, 95% of the scores would fall between 430 and 660. 

Since the percentage of test-takers who score within one or two standard deviations of the mean is always 
the same for any test, the standard deviation is a common unit of measurement that can be used to make 
limited comparisons of groups of test-takers. In this study, effect size is used, which is the proportion of 
the difference between two years of test data or two subgroups of students in standard deviation units. 

Source: Stockburger (1996), and Center on Education Policy. 



improvement or a small one? It is still somewhat of a subjective figure, but broad compar- 
isons are possible. Two Harvard University researchers, in an earlier review of NCLB, noted 
that between 1967 and 1982, scores of U.S students on the SAT college admissions test fell 
by 0.3 standard deviations. Between 1970 and 1982, high school science scores on the 
National Assessment of Educational Progress fell by 0.4 standard deviations. These trends 
were considered alarming at the time, and were among the factors that gave rise to the 
accountability movement in education (Peterson & West, 2003, pp. 3-5). 

Lets say a hypothetical state showed an improvement of +1.0 standard deviations between 
2002 and 2006. This would constitute a huge leap in student performance. Assuming a 
normal curve, as shown in box C, that gain would be the equivalent of an increase from 50% 
of students performing at the proficient level or above in 2002 to 84% in 2006. 

Because effect sizes are based on mean test scores, they capture changes in student 
achievement above and below the proficient level. This helped us address one of the limitations 
of the percentage proficient measure. Effect sizes also have an advantage over mean scale 
scores alone in that they provide a standard index for gauging achievement and gap trends 
across states. Since state tests use different scoring scales (such as 1-100 or 1-500), it is dif- 
ficult to interpret changes in mean test scores from one state to another. Effect sizes allow 
researchers to make some limited comparisons between states. 

Effect sizes do have limitations. They do not enable comparisons within the same state if 
that state had a change in its testing program. To compare statistics within the same state, 
the test data for various years must be expressed on the same continuous scale because the 
computation involves subtracting one years mean from another year’s mean. If, for instance, 
a state used one scale for tests in 2001 and 2002, and then changed the scale for 2003 and 
2004, one could compute an effect size for the differences in means from 200 1 to 2002 and 
from 2003 to 2004. But one could not compute a comparable effect size from 2002 to 
2003. Thus, one cannot use effect sizes to examine achievement trends unless a state has 
maintained a continuous scale that allows for these comparisons. 

Effect sizes also do not take into account the relative difficulty of tests and standards from 
state to state. They only allow one to compare improvement or decline on each states respective 
tests, such as comparing how much students in Minnesota have improved on Minnesota’s 
state test with how much students in Vermont have improved on Vermont’s state test. One 
still doesn’t know which state’s test is more difficult, or which state’s students are achieving 
more. Furthermore, effect size is a measure of relative growth, not absolute performance. For 




Center on Education Policy 




Answering the Question That Matters Most 



these reasons, we do not compare or rank states based on effect size results. Instead, we use 
the effect size as an additional source of information to examine achievement changes within 
each state, and then we summarize changes in each state to arrive at a national picture of stu- 
dent achievement since 2002. 

We make no claim that effect size data are going to provide us with a perfect picture of student 
achievement since the inception of NCLB. Effect sizes are just another piece of information 
to supplement the percentage proficient measure, so we are drawing conclusions based on 
two sources of information about achievement rather than just one. 




FLAGGING SMALL AND CHANGING SUBGROUPS 

To address the problem of changing composition of some subgroups, we collected the number 
of students tested (often referred to as N-counts) in reading and math at each tested grade 
and for each subgroup tracked under NCLB. In our analysis of achievement gaps, the N-counts 
were used to consider whether a subgroup of students was large and stable enough to draw 
valid inferences about achievement trends. 

First, we flagged subgroups that were small with a foomote, because small groups are especially 
susceptible to year-to-year fluctuations in test scores that do not reflect actual changes in student 
achievement (see CEP, 2002 for a fuller discussion). We defined small as less than 5% of the 
total state student population tested at a grade, or fewer than 500 students. 

Second, we flagged with a footnote any subgroups that had changed substantially in size 
during the period studied. When the size of a subgroup increases or decreases very rapidly 
in the course of a few years, this complicates trend analyses. Changes in test results may be 
due to changes in the composition of the subgroup as well as changes in achievement. We 
defined rapidly changing subgroups as those that changed by at least 25% up or down during 
the years analyzed. For all flagged subgroups, we have reported test results in the tables in 
the state profiles, along with specific cautions about interpreting them. 

Third, we addressed the problem of changes in policies for students with disabilities and 
LEP students by deciding not to present trends for these two groups in the national summary 
of achievement gaps in chapter 5. There was no way to arrive at valid and reliable trends in 
achievement for these two subgroups for the reasons described in box A. We also noted in 
our analysis that trends for these two subgroups should be interpreted with caution. We did 
record test results and some trends for these two subgroups in the online state profiles, with 
reminders about the need for caution when drawing conclusions. 



Problems with Tests That We Did Not Address 

Even in states that could provide data on percentages proficient, means, standard deviations, 
and numbers of students tested, one must still be cautious when interpreting test score 
trends. There is a certain degree of “fuzziness” and distortion in state test results that is 
derived from the tests themselves and the way they are created, administered, and scored. 
For instance, student performance may be affected by changes in the specific test questions 
that appear on each years version of the test and by scoring procedures and cut scores. A 
state can adjust its test from year to year in ways that can affect the validity of inferences 
about changes in test scores over time. 



In addition to the major changes noted above that create obvious breaks in data, test results 
can be affected by less explicit or unintentional changes. There can still be subtle manipulation 
of tests through a series of small decisions made by test administrators — tinkering rather 
than momentous changes. Following are some test construction issues and decisions that can 
affect the comparability of test scores from year to year: 

• Test equating. To guard against breaches of test security, such as teachers and students 
memorizing test questions and answers, states use different forms of a test each year, 
composed of partially or entirely different test questions. Test developers try to make 
each test form similar in terms of the general content covered and level of difficulty. In 
addition, they often use a statistical technique called equating to make it possible to com- 
pare multiple forms of the same test with each other. Technical factors or the use of an 
incorrect equating methodology can introduce various types of errors into the equating 
process, producing forms that are not truly comparable (Kolen, 1988). A member of 
our expert panel has observed that typical equating procedures used by states can cause 
annual fluctuations in the percentage proficient of plus or minus 2% or more (Linn, n.d.) 

• Weighting of test questions. Many state tests use a combination of multiple-choice, 
short-answer, and essay questions. Usually the questions that require students to write 
out their responses are worth more points on a test than multiple-choice questions; that 
is, they are more highly weighted. If the relative proportion and weighting of different 
types of test questions changes from year to year on a state s test, this can affect the com- 
parability of scores. 

• Changes to scoring procedures. Short- answer and essay questions must be scored by 
hand by trained scorers. If the scoring guidelines (called rubrics) or training procedures 
change even slightly from year to year, this can affect the comparability of test results. 

• Re-use of test questions. For cost effectiveness and equating purposes, states often re-use 
some test questions across years. When entire test forms or large numbers of items are 
used repeatedly, students and teachers tend to become familiar with the questions and 
memorize the answers. While test scores may go up, the trend can be misleading if 
students have simply learned the answers to particular test questions but have not truly 
learned more about the larger subject being assessed. 

These are just some examples of factors that can affect the comparability of test results from 
year to year in the same state. Even when accurate and complete test data are obtained, more 
subtle changes in state testing systems of the type described above can affect results. In an 
atmosphere of intense pressure to demonstrate achievement gains, administrators might err 
on the side of leniency when making these types of decisions. Based on test information that 
states make publicly available, it is often difficult to tell whether or how much any of the 
factors mentioned above actually distort the picture of student achievement derived from 
test scores in a state. These issues will be further explored during in-depth state interviews 
in phase II of this study. 




Center on Education Policy 



Answering the Question That Matters Most 



Rules for Analyzing Data 




To analyze the state achievement data collected during phase I of this study, we took pains 
to develop consistent rules for analysis that would weed out incompatible data; identify 
trends that were consistent enough across grades and years to indicate a clear pattern of 
improvement; avoid “cherry picking” years, grades, or subgroups with the best or worst 
performance; and treat all states similarly and fairly. With extensive input from the expert 
panel, CEP and HumRRO arrived at the following rules for reporting and analyzing data: 

• Grades analyzed. We looked separately at the elementary, middle, and high school 
levels for all of the achievement analyses. In states that tested multiple grades within 
each of these spans, we made decisions about which specific grades to report on and 
analyze based on a fixed set of criteria that were applied consistently across all states and 
developed without regard to whether achievement was high or low at a given grade. 
Generally, the grades selected were those with the longest trend lines. For analyses of 
effect sizes and achievement gaps, we selected one representative grade at the elementary, 
middle, and high school levels from among the grades included in the overall percentage 
proficient analysis. The first choices for these analyses were grades 4, 8, and 10, but if 
trend data were not available for these grades in a specific state, an adjacent grade was 
used in a fixed order. The detailed criteria for selecting grades are explained in the appendix. 

• Years analyzed. If the state introduced a new test in 2005 or earlier, the analysis used 
that test and ended in 2005. If the state introduced a new test in 2006, the prior test was 
used in our analysis. Because many states introduced tests at different times in different 
grades, the years covered by our analyses sometimes varied at the elementary, middle, or 
high school level. For example, the analysis in a state might span 1 999-2004 at the ele- 
mentary and middle school levels but cover 2005-2006 at the high school level. 

• Separate analyses for reading and math. Changes in achievement were analyzed separately 
for reading and math, since student performance in these subjects is often very different. 

• Trend determinations. Differences involving just two years of data were referred to as 
“changes” in achievement rather than trends, since two years are too short of a period to dis- 
cern whether a change is an actual trend or simply a normal year-to-year fluctuation in test 
results. On a similar note, we based our determinations of achievement trends on a broad 
pattern across multiple years, disregarding the kinds of small year-to-year fluctuations that 
typically occur in test results. For our findings about achievement gaps, we considered an 
increase or decrease in the gap for a specific subgroup to be a trend if it occurred in the same 
subject across all three grade spans analyzed (elementary, middle, and high school). 

• Emphasis on average yearly gains. To even out the normal year-to-year fluctuations 
that occur with any test, we averaged gains or declines in test results across a period of 
years and focused on these average yearly gains in our analyses. 

• “Slight” increases or decreases. We characterized an average change in achievement of 
less than 1 percentage point per year as a “slight” increase or decrease. This is because test 
scores are not perfect and include some measurement error resulting from factors unrelated 
to student learning, such as those listed earlier in this chapter in the discussion of standard 
error. “Slight” increases or decreases should be interpreted with caution because they may 
reflect measurement error rather than real changes in student achievement. 



• Subgroups analyzed. The subgroups included in the achievement gap analyses were 
those tracked for accountability purposes in the NCLB law: major racial/ethnic groups 
in the state, low-income students, students with disabilities, and limited-English-proficient 
students. Within the state tables, we used the labels for these subgroups used by that 
particular state, so the subgroup names vary among states. When reporting subgroup 
trends, we did not mention subgroups that performed roughly the same as or higher than 
the comparison group of white students, as the Asian subgroup did in most states. 

• Subgroup comparisons. For subgroups other than racial/ethnic groups, we compared 
the achievement of the subgroup of interest with the universe of other students who were 
not in that subgroup, whenever possible. For example, we compared low-income students 
with students who were not low-income when these comparison data were available. 
When test results for the comparison group were unavailable, we compared the group of 
interest to the state as a whole — for example, we compared low-income students with all 
students in the state. Although this latter approach is not the optimum one, it was the 
best option available in some states. 

• Small or changing subgroups. As noted above, we flagged results for subgroups that 
were small or had changed significantly in size and included notes about interpreting 
results for these subgroups with caution. 

• Special caution for students with disabilities and LEP students. As explained above, 
we avoided reaching national conclusions about these subgroups and cautioned people 
not to put too much stock in apparent trends for these subgroups. 

Detailed information about other methods we used can be found in the appendix. 




Individual State Profiles on the Web 

The findings in this report are based on state-by-state analyses of achievement data. These 
state analyses, along with the detailed data tables and figures on which they are based, have 
been packaged into profiles for every state. Individual state profiles can be viewed and down- 
loaded from the CEP Web site at www.cep-dc.org/pubs/stateassessment. We encourage all 
readers who are interested in trends for a specific state to visit the Web site and look at the 
profile for that state. Box D lists the information contained in the state profiles. 



Center on Education Policy 



Answering the Question That Matters Most 



Contents of Profiles for Individual States 




Box D. 



The state profiles available on the CEP Web site (www.cep-dc.org/pubs/stateassessment) contain the 

following descriptive information: 

• Test characteristics. A list of the key characteristics of the reading and math tests used in the state for 
NCLB accountability, including the test name, grades tested, time of year when the test is administered, 
first year the test was administered, and major changes in the testing system since 2002. 

• Summary of findings. The most important trends emerging from our analyses of state achievement data. 

• Achievement trends. Findings from our analyses of overall trends in student achievement based on 
percentages proficient and effect sizes where available. 

• Gap trends. Findings from our analyses of trends in achievement gaps based on percentages 
proficient and effect sizes where available. 

Each profile also contains the following data figures and tables, based on the data available by the 

deadline for phase I of the study: 

• Overall percentages proficient. Figures and tables for reading and math showing the percentages of 
students scoring at or above the proficient level for various grades at the elementary, middle, and 
high school levels. These figures and tables coverall of the years from 1999 through 2006 for which 
comparable data were available. The tables also show the average yearly gains or declines in 
percentages proficient before and after 2002, when NCLB took effect. 

• Overall effect size data (where available). Figures and tables for reading and math displaying mean 
test scores, standard deviations, and effect sizes for one grade at each of three grade spans 
(elementary, middle, and high school). These figures and tables coverall of the years from 1999 
through 2006 for which comparable data were available. The tables also show the average yearly 
gains or declines in effect size before and after NCLB. 

• Gaps in percentages proficient. Tables for reading and math showing percentages proficient by 
student subgroup at three different grade levels for 2002 and 2006 (or for whichever adjacent years 
comparable data were available). Subgroups displayed on these tables include all the major 
racial/ethnic subgroups in the state, plus low-income students, students with disabilities, and 
limited-English-proficient students. These tables also show the percentage point gaps between 
various subgroups at the selected grades, changes in achievement gaps during the period analyzed, 
and average yearly gains or declines in gaps. 

• Gaps by effect size (where available). Tables for reading and math showing gaps by effect size for 
subgroups of students for 2002 and 2006 (or for whichever adjacent years comparable data were 
available). Effect size data are included for three different grade levels for the subgroups of students 
listed above for which data are available. These tables also indicate changes in effect size gaps over 
the years analyzed and average yearly gains or declines in the effect size gap. 

• Supplemental tables. Additional tables intended primarily for researchers. These include overall 
percentages proficient in reading and math converted to z-scores (defined in the glossary at the end 
of this report); gaps in percentages proficient converted to z-scores; and, where available, data on 
the number of test-takers in each subgroup for the period analyzed. 





Chapter 4 

Trends in Overall Achievement 



Key Findings 

• The weight of evidence indicates that state test scores in reading and mathematics have 
increased overall since NCLB was enacted. All of our analyses — including percentages of 
students scoring proficient, effect sizes (a measure based on average, or mean, test scores), 
and pre- and post-NCLB trends — found substantially more states with gains in student 
test results than with declines since 2002. 



CL 

C 



• Regardless of whether one analyzes percentages proficient or effect sizes, the number of 
states showing achievement gains since 2002 is far greater than the number showing 
declines. (The subset of states with sufficient data varies, depending on the particular 
analysis.) For example, of the 24 states with both percentage proficient and effect size 
data for middle school reading, 1 1 states demonstrated moderate-to-large gains in this 
subject and grade span, while only one state exhibited a moderate or larger decline. Using 
percentage proficient data alone, 20 of the 39 states with this type of data showed mod- 
erate-to-large gains in middle school reading, while only one state showed a moderate or 
larger decline. 

• Of the 22 states with both percentage proficient and effect size data, 5 made moderate- 
to-large gains in reading and math across all grade spans (elementary, middle, and high 
school) according to both measures. In other words, these five states showed gains 
according to all of the indicators collected for this study, allowing one to conclude with 
some confidence that achievement has gone up in those states. In reading, seven states 
showed moderate-to-large increases across all grades analyzed, according to both the per- 
centage proficient and effect size measures. In math, nine states showed similar gains 
across all grades analyzed on both measures. (The group of seven and the group of eight 
states include the five states that made gains in both subjects.) The rest of the states had 
different trends at different grade spans. 

• Elementary-level math is the area in which the most states showed improvements. Of the 
25 states with both percentage proficient and effect size data in elementary math, 22 
demonstrated moderate-to-large math gains at the elementary level on both measures, 
while none showed moderate or larger declines. Based on percentages proficient alone, 
37 of the 4 1 states with trend data in elementary math demonstrated moderate-to-large 
math gains, while none showed declines of that magnitude. 

• More states showed declines in reading and math achievement at the high school level 
than at the elementary or middle school levels. Still, the number of states with test score 
gains in high school exceeded the number with declines. 





Answering the Question That Matters Most 



• Since many states had reform efforts well underway before NCLB, it is useful to know 
whether the pace of improvement has picked up or slowed down since NCLB took effect 
in 2002. Only 13 states supplied enough years of data for us to make this determination. 
In nine of these states, test results improved at a greater yearly rate after 2002 than before. 
In the other four states, the pre-NCLB rate of average yearly gain outstripped the post- 
NCLB rate. 

• Analyzing changes in achievement using effect sizes generally produced the same findings 
as analyzing achievement using percentages proficient. But in some cases, the effect size 
analysis showed a different trend. For instance, in Nevada the percentage proficient in 
high school math decreased while the average test score increased. Conversely, in New 
Jersey the percentage proficient in middle school reading increased slightly, while the aver- 
age test score dropped. 




How We Analyzed Overall Test Score Trends 

Improving the academic achievement of all public elementary and secondary school stu- 
dents is a primary goal of the No Child Left Behind Act, along with closing achievement 
gaps. This chapter describes our findings about trends in overall achievement. We looked at 
two measures of achievement, where available: 

• The percentages of students scoring at or above the proficient level on state tests — 
the primary measure of adequate yearly progress under NCLB 

• Effect sizes, which are based on mean test scores, and standard deviations give a more 
complete picture of the entire distribution of student performance 

We focused mainly on achievement trends since NCLB took effect in 2002. In states with 
available data, we also compared trends before and after NCLB. 

To arrive at the findings in this chapter, we produced very detailed data profiles for each of 
the 50 states, consisting of up to four figures and 1 3 tables per state and narrative descrip- 
tions of that state’s achievement trends. The figures and tables included a host of data on 
percentages proficient, mean scores, effect sizes, and other information described in chapter 3. 
(The state profiles can be found online atwww.cep-dc.org/pubs/stateassessment/.) Using the 
data in the profiles, we closely analyzed achievement trends in reading and math within indi- 
vidual states, looking grade by grade across all the years. This was an enormous undertaking 
due to the amount of data involved. We then coded and compiled our findings from the 50 
states to produce the tables in this chapter and develop a national picture of achievement 
trends. The appendix provides more detailed information about study methods. 

During phase I of our study, we could not obtain from every state all of the data necessary 
to do pre- and post-NCLB analyses of percentages proficient and effect sizes. Therefore, the 
total number of states with sufficient data is different for each type of analysis: 

• In 30 states, we obtained both percentages proficient and effect size data for at least some of 
the years between 2002 and 2006. In general, these are the states in which we have the most 
confidence about post-NCLB trends because we could use the two types of analyses as cross- 
checks. In 8 of these 30 states, data were missing for a particular grade level or did not cover 
enough years at a grade level to constitute a three-year trend. The number of states with suf- 
ficient data — as well as the specific subset of states — differs by grade span and subject. 



• In all 50 states, we obtained percentages proficient for at least some of the years between 
2002 and 2006. This group includes the 30 states described above, plus 20 states that 
did not make available effect size data by the phase I deadline but did have percentages 
proficient. In 16 of the 50 states, data were missing for a particular grade level or did not 
cover enough years at a grade level to constitute a three-year trend. 

• Only 13 states provided sufficient achievement data for the years before 2002 to enable 
us to compare achievement trends before and after NCLB took effect. 

When data were available for less than three years, this was mainly due to breaks in com- 
parability caused by the introduction of new tests or changes in existing testing systems, 
such as changes in cut scores or content standards. Often these changes were made partly 
in response to the testing and accountability requirements of NCLB. Chapter 7 explains in 
detail which states provided various types of data and why data are limited or have breaks 
in comparability. 



Findings about Overall Test Score Trends Since 2002 

Below we describe our findings about broad trends in achievement since 2002. We also 
spotlight trends by subject in reading and math and offer possible explanations for the 
trends we found. 




TRENDS BASED ON PERCENTAGES PROFICIENT AND EFFECT SIZES 

In states with both percentages proficient and effect size data, we used both of these meas- 
ures to analyze achievement trends. 



Trends Across Three Grade Spans 

Table 1 displays the number of states with achievement gains in reading and mathematics 
at all three grade spans (elementary, middle, and high school) for the states with both 
percentage proficient and effect size data. Some states showed consistent gains across grade 
spans in both subjects while others states made consistent gains in just one subject. The rows 
of the table indicate the size of the gains. (As noted in chapter 3, we classified an average 
annual gain or decline in the percentage proficient as moderate-to-large if it equaled one per- 
centage point or more and as slight if it equaled less than one percentage point.) None of 
the states with both percentage proficient and effect size data showed declines across all grade 
spans, so there are no rows to indicate them. There is also a row for states with mixed positive 
results, referring to a mixture of slight gains and moderate-to-large gains. For example, a state in 
this row may have had moderate-to-large gains at the elementary and middle levels, but only a 
slight gain in high school; or the state might have shown gains in two grade spans according to 
both percentages proficient and effect sizes but in the remaining grade span according to percent- 
ages proficient only. 



Center on Education Policy 



Answering the Question That Matters Most 




Table 1. Number of States with Various Test Score Trends 
Since 2002 Across All Three Grade Spans 

(Includes states with both percentage proficient and effect size data) 



Type of Trend 


Both Reading & Math 
All Grade Spans 


Reading 
All Grade Spans 


Math 

All Grade Spans 


Moderate-to-large gains in 
both percentage proficient 
and effect size 


5 


7 


9 


Slight gains 1 in both percentage 
proficient and effect size 


0 


1 


0 


Mixed slight and moderate-to-large 
gains in percentage proficient 
and/or effect size 


4 


4 


6 


Number of states with sufficient 
trend data for this analysis 2 


22 


22 


22 



Table reads: Since NCLB was enacted in 2002, five state have made moderate-to-large gains in both the percentages 
of students scoring proficient and in effect sizes in reading and math at all three grade spans analyzed (elementary, 
middle, and high school). Seven states have made gains at all three grade spans in reading, and nine made gains at 
all three grade spans in math. 

1 A “slight gain” means an average yearly gain of less than 1.0 percentage point. 

2 States with sufficient data have comparable test data for at least three of the years between 2002 and 2006. 



Some notable findings from table 1 : 

• Of the 22 states with sufficient trend data for three grade spans in reading and math, five 
states — Delaware, Kansas, Kentucky, Louisiana, and Washington — demonstrated gains 
in both subjects and all three grade spans based on both the percentage proficient and 
effect size measures. In other words, these five states showed moderate-to-large gains 
according to all of the indicators collected for this study. Another four states demon- 
strated gains that varied in magnitude or by measure. 

• Seven of the 22 states with sufficient data in reading showed moderate-to-large gains in 
this subject at all three grade spans, according to both the percentages proficient and 
effect size measures; these states include the five listed above plus Tennessee and Idaho. 
Nine of the 23 states with sufficient data in math made moderate-to-large gains in this 
subject at all three grade spans on both measures; these states include the five listed above 
plus Mississippi, New Jersey, Utah, and West Virginia. 

• No state showed declines of any magnitude across all grade spans in either subject. 





Trends by Subject and Grade Span 

Table 2 summarizes trends in test scores separately by grade span and subject for the 30 
states with both percentage proficient and effect size data. The specific number of states with 
sufficient data to analyze trends in each subject and grade span varied. The rows of the table 
display different types of achievement trends based on one or both measures, ranging from 
gains to declines. The top row of the table, for example, displays the numbers of states 
demonstrating moderate-to-large gains in both percentages proficient and effect sizes. The 
second row shows the number of states with gains in percentages proficient but a different 
trend in effect sizes, either a flat trend or a decline. 



Table 2. Number of States with Various Test Score Trends Since 2002 
by Grade Span and Subject 

(Includes states with both percentage proficient and effect size data) 



Type of Trend 


Reading 

Elementary 


Reading 

Middle 


Reading 

High 


Math 

Elementary 


Math 

Middle 


Math 

High 


Moderate-to-large gain in 
both percentage proficient 
and effect size 


14 


11 


10 


22 


17 


12 


Moderate-to-large gain in 
percentage proficient only, 
contradicted by effect size 


1 


0 


1 


0 


0 


1 


Slight gain 1 in both percentage 
proficient and effect size 


3 


5 


4 


1 


3 


4 


Slight gain 1 in percentage 
proficient only, contradicted 
by effect size 


4 


3 


2 


0 


2 


0 


Slight decline 1 in both 
percentage proficient and 
effect size 


2 


2 


1 


1 


2 


1 


Slight decline 1 in percentage 
proficient only, contradicted 
by effect size 


0 


2 


2 


1 


0 


2 


Moderate-to-large decline in 
both percentage proficient 
and effect size 


1 


1 


2 


0 


0 


1 


Moderate-to-large decline in 
percentage proficient only, 
contradicted by effect size 


0 


0 


0 


0 


0 


1 


Total number of states with 
sufficient data for trend 2 


25 


24 


22 


25 


24 


22 



Table reads: Since NCLB was enacted in 2002, 14 states have made achievement gains in reading at the elementary 
level, based on both the percentage of students scoring at the proficient level and effect size. 




‘A “slight gain” or “slight decline” means an average yearly gain or decline of less than 1.0 percentage point. 
2 States with sufficient data have three or more years of comparable test data since 2002. 



Center on Education Policy 





Answering the Question That Matters Most 




The columns show achievement trends separately for reading and math at each of the grade 
spans analyzed. For example, the column for elementary reading indicates that 14 states 
experienced moderate-to-large gains using both measures, but that 1 state had a gain in the 
percentage proficient which was not confirmed by its effect size trend. When the same 
upward trend appears across both measures (percentages proficient and effect sizes), one can 
conclude with some confidence that achievement has improved since NCLB was enacted. 
Each column also shows the total number of states with three or more years of comparable 
test data since 2002 in a particular subject and grade span. 

In table 2, as in table 1, the number of states demonstrating gains in student performance since 
2002 far exceeds the number showing declines. For instance, table 2 displays the following trends: 

• Twenty-two of the 25 states with sufficient data experienced moderate-to-large gains on 
both measures in elementary math. In general, more states showed improvements at the 
elementary level than at the middle or high school level. 

• The number of states with moderate-to-large gains far exceeded the number with slight gains. 

• Only a handful of states — no more than five in a given subject and grade span — showed 
declines of any magnitude. No state showed a decline at all grade spans in reading or 
math. More states showed declines at the high school level than at the lower grades. 



Similarity between Percentage Proficient and Effect Size Trends 

In most states with effect size data, the effect size analysis confirmed the trends in percentages 
proficient. This gives us a fair amount of confidence in the results. In some states, however, the 
findings did not converge, which suggests the importance of conducting both types of analyses. 
When the two measures diverge, this may be a signal to look carefully at the gains in percent- 
ages proficient and be cautious in drawing conclusions about overall achievement trends. 

An example of divergent trends occurred in Nevada. Table 3 shows the percentage of 
Nevada students performing at or above the proficient level in math. (Data were available in 
this state only for the years 2004 through 2006.) 



Table 3. 


Percentage of Nevada Students Scoring at the Proficient Level 
or Above in Math 


Grade Level 


2004 


Reporting Year 
2005 


2006 


Post-NCLB 
Average Yearly 
Percentage Point Gain 1 


Grade 3 


45 % 


51% 


50% 


2-7 


Grade 5 


50% 


51% 


55 % 


2.4 


Grade 8 


49 % 


49 % 


50% 


0.4 


Grade 10 


52% 


51% 


47 % 


-2.8 



Table reads: The percentage of Nevada 3 ,d graders who scored at or above the proficient level on the state math test 
increased from 45% in 2004 to 51% in 2005, then declined slightly to 50% in 2006. The average yearly gain in the 
percentage proficient at grade 3 was 2.7 percentage points after NCLB took effect (2004-2006). 



Averages are subject to rounding error. 





School officials concerned with making adequate yearly progress under NCLB or parents 
wanting to know about the quality of Nevada schools might look at the percentages profi- 
cient in table 3 and see that students in grades 3 and 5 are making progress in math. But 
they might be worried about students in grade 10, where the percentage proficient in math 
declined by five percentage points in just two years — from 52% in 2004 to 47% in 2006. 

Are Nevada high school students doing worse in math? Maybe not. Table 4 shows student 
performance in terms of effect sizes. 

Here we see that the mean math score of Nevada 10 th graders actually increased between 
2004 and 2006 — from 288.6 to 293.1; in terms of effect sizes this was 8% of one standard 
deviation. This increase runs counter to the percentage proficient trend. This might have 
occurred because of improvement in mean scores among students either below or above the 
proficient level. A far more detailed analysis would be necessary to determine the exact rea- 
sons, but in any case, this example illustrates why multiple measures of performance are nec- 
essary to determine whether student achievement has increased since NCLB’s inception. 



Table 4. Nevada Achievement Trends in Math in Terms of Effect Size 




Grade Level 


2004 


Reporting Year 
2005 


2006 


Post-NCLB 
Average Yearly 
Percentage Point Gain 1 


Grade 5 
MSS (SD) 
AAES 


294.7 (69.1) 
0.00 


300.6 (71.8) 
0.08 


302.0 (70.8) 
0.10 


0.05 


Grade 8 
MSS (SD) 
AAES 


291.7 (97.0) 
0.00 


291.9 (98.2) 
0.00 


295-9 ( 97 - 8 ) 
0.04 


0.02 


Grade 10 
MSS (SD) 
AAES 


288.6 (58.5) 
0.00 


289.5 (57-3) 
0.02 


293-1 ( 577 ) 
0.08 


0.04 



Table reads: The mean scale score (MSS) of Nevada 5 th graders on the state math test increased from 294.7 in 
2004, to 300.6 in 2005, to 302.0 in 2006. The standard deviation (SD) for the mean scale score in 2004 was 69.1 
(a statistic needed to calculate effect size). Using 2004 as a starting point (0.00), the accumulated annual effect 
size (AAES) for grade 5 math totaled 0.10 standard deviation units by 2006. For the period after NCLB (2004- 
2006), the average yearly gain in effect size for grade 5 was 0.05 standard deviation units (0.10 -e 2 years). 

Note: Nevada’s tests used for NCLB are scored on a scale of 100-500. 

1 Averages are subject to rounding error. 



Center on Education Policy 





Answering the Question That Matters Most 




Table 5. Number of States with Various Trends in Percentages Proficient 
Since 2002 Across All Three Grade Spans 


Type of Trend 


Both Reading & Math 
All Grade Spans 


Reading 
All Grade Spans 


Math 

All Grade Spans 


Moderate-to-large gains 


7 


9 


19 


Slight gains 1 


0 


2 


0 


Mixed slight and 
moderate-to-large gains 


8 


9 


8 


Number of states with 
sufficient trend data 
for this analysis 2 


34 


34 


37 



Table reads: Since NCLB was enacted in 2002, 7 states have made moderate-to-large gains in the percentage of students 
scoring proficient in reading and math at all three grade spans analyzed (elementary, middle, and high school). Nine 
states have made gains at all three grade spans in reading, and 19 have made gains at all three grade spans in math. 



1 A “slight gain” means an average yearly gain of less than 1.0 percentage point. 

2 States with sufficient data have comparable test data for at least three of the years between 2002 and 2006. 



TRENDS BASED SOLELY ON PERCENTAGES PROFICIENT 

We analyzed trends in percentages proficient alone in all 50 states. 



Trends Across Three Grade Spans 

Table 5 displays the number of states with achievement gains in reading and mathematics 
at all three grade spans (elementary, middle, and high school), using percentage proficient 
data only. Table 5 includes those states from table 1 that had effect size data as well, plus 
additional states that only had percentage proficient data. 

Table 5 largely echoes the results displayed in table 1 . Seven states out of the 34 with suffi- 
cient data showed moderate-to-large gains across both subjects and all three grade spans. 
Nine states demonstrated moderate-to-large gains across all grade spans in reading, and 19 
did so in math; these totals include the 7 states that made gains in both subjects. No state 
showed declines across all three grade spans in either subject. 



Trends by Subject and Grade Span 

Table 6 summarizes trends in achievement by subject and grade span for the states with per- 
centage proficient data. Although all 50 states were able to supply varying amounts of per- 
centage proficient data, the actual number with sufficient data to analyze trends across three 
years varied by subject and grade span. 





Table 6 . Number of States with Various Trends in Percentages Proficient Since 

2002 by Grade Span and Subject 



Type of Trend 


Reading 

Elementary 


Reading 

Middle 


Reading 

High 


Math 

Elementary 


Math 

Middle 


Math 

High 


Moderate-to-large gains 


29 


20 


16 


37 


32 


26 


Slight gains 1 


7 


13 


12 


2 


6 


6 


Slight declines 1 


3 


5 


5 


2 


2 


6 


Moderate-to-large declines 


2 


1 


4 


0 


0 


2 


Total number of states with 
sufficient date for trend 2 


41 


39 


37 


41 


40 


40 



Table reads: Since NCLB was enacted in 2002, 29 states have made gains in reading at the elementary level, based 
on percentages of students scoring proficient. 

1 A “slight gain” or “slight decline” means an average yearly gain or decline of less than 1.0 percentage point. 

2 States with sufficient data have three or more years of comparable test data since 2002. 




Table 6 shows the following trends: 

• Thirty-seven states demonstrated moderate-to-large gains in math at the elementary level 
out of 41 states with sufficient trend data for this subject and grade span. In general, 
more states showed gains in percentages proficient at the elementary level than at the 
middle or high school levels. 

• The number of states with moderate-to-large gains in percentages proficient far outnum- 
bered those with slight gains. 

• Declines in percentages proficient were less frequent. No more than nine states showed 
declines of any magnitude in a particular grade and subject. More states showed declines 
at the high school level than at the lower grades. 



Average Yearly Gains in Percentages Proficient 

How much progress have states made in raising their percentages proficient? Since a certain 
amount of year-to-year fluctuation in test scores is normal, it is often more meaningful to 
look at average yearly gains than to simply compare one year’s percentage proficient with 
another’s. (The average yearly gain or decline is determined by computing the cumulative 
change over a period of years and dividing by the number of years.) 

For each of the grade spans with two or more years of comparable data (the minimum period 
needed to compute average yearly gains), we calculated the median of states’ average yearly 
gains since 2002. (The median is a sort of midpoint; an equal number of states fall above or 
below the median.) The results for reading are displayed in table 7 and the results for math in 
table 8. In reading, the median of states’ average yearly gains in percentage proficient since 



Center on Education Policy 





Answering the Question That Matters Most 



Table 7. 


Statistics on Average Yearly Gains in Percentages Proficient in 
Reading Since 2002 


Statistic 


Elementary 


Middle 


High 


Median 


1.8 


1.0 


1.0 


Minimum 


-2.2 


-2.0 


-2.9 


Maximum 


10.0 


8.0 


18.0 


Number of states with valid data 49 


49 


46 



Table reads: Of the 49 states with at least two years of comparable data, the median average yearly gain in the percentage 
proficient in reading was 1.8 percentage points per year at the elementary level. Among individual states, the average yearly 
gains in percentage proficient in elementary reading ranged from a minimum of -2.2 percentage points peryear(a decline) 
to a maximum of +10.0 percentage points per year. 



Table 8. 


Statistics on Average Yearly Gains in Percentages Proficient in 
Math Since 2002 


Statistic 




Elementary 


Middle 


High 


Median 




3.0 


2.1 


1.8 


Minimum 




-0.9 


-1.0 


-4.0 


Maximum 




11.0 


11.0 


7-7 


Number of states with valid data 


49 


49 


47 




Table reads: Of the 49 states with at least two years of comparable data, the median average yearly gain in the percentage 
proficient in math was 3.0 percentage points per year at the elementary level. Among individual states, the average yearly 
gains in percentage proficient in elementary math ranged from a minimum of -0.9 percentage points per year (a decline) to 
a maximum of +11.0 percentage points per year. 



2002 was 1.8 percentage points per year at the elementary level, and 1.0 percentage points at 
both the middle and high school levels. In math, the median of states’ average yearly gains was 
notably higher — 3.0 percentage points per year at the elementary level, 2.1 at the middle 
school level, and 1.8 at the high school level. Above and below the median, the average yearly 
gains in individual states covered a wide spectrum. In elementary reading, the average yearly 
gains in percentages proficient ranged from a minimum of -2.2 percentage points (in other 
words, a decline) in one state to a maximum of +10.0 percentage points in another. In high 
school reading the range of average yearly gains was even broader — from a minimum of -2.9 
percentage points (a decline) in one state to a maximum +18.0 percentage points in another. 







TEST SCORE TRENDS IN READING AND MATH SINCE 2002 



In addition to analyzing broad trends across grades and subjects, we also took a closer look 
at separate achievement trends in reading and in math since 2002. 



Reading Trends 

In reading, performance has increased since 2002. As already noted, seven states showed 
gains in reading across all three grade spans in both percentages proficient and effect sizes 
(table 1). Based on percentages proficient alone, 9 states demonstrated moderate-to-large 
increases across the three grade spans (table 5). 

Within the same grade span, more states demonstrated increases in reading than declines. 
For example, based on both percentage proficient and effect size data (table 2), 14 states 
showed moderate-to-large gains at the elementary level, while just one state showed a 
decline. Based on percentages proficient alone, 29 states experienced moderate-to-large gains 
in elementary reading, while only 2 states experienced moderate-to-large declines (table 6). 

Fewer states made improvements in reading at the high school level than at other grade levels. 
Based on both percentage proficient and effect size data (table 2), two states showed 
moderate-to-large declines at high school. Based solely on percentages proficient (table 6), 
four states experienced high school declines. 



Mathematics Trends 

In math, performance has also increased since 2002. As table 5 illustrates, 19 states showed 
moderate-to-large gains in percentages proficient across all grade spans in math; among 
them were the 9 states with gains across all grade spans in effect sizes, as well (table 1). 

Within the same grade span, more states experienced gains in math than declines. Of the 
states with both percentage proficient and effect size data (table 2), 22 showed moderate-to- 
large gains in math at the elementary school level, while none had declines. As with reading, 
improvements at the high school level were less striking than at other grade spans; only 
12 states showed moderate-to-large gains at the high school level. 

Results in math were most impressive at the elementary level. Based on proficiency data 
alone (table 6), 37 of the 41 states with sufficient data showed at least moderate gains in 
elementary math. 

POSSIBLE EXPLANATIONS FOR TRENDS SINCE 2002 

Our evidence shows that gains in state test scores have far outweighed declines since NCLB 
took effect. Below we offer some possible explanations for the increases. The list is not exhaus- 
tive, but these are the explanations most often mentioned in research on test trends. Any or 
all of these factors in combination may be contributing to these trends. Moreover, different 
explanations could apply to different states or school districts within states. 

• Increased learning. One likely reason for the upward trends in state test scores is that 
students are learning more and consequently are doing better on state tests. 
Administrators and teachers have made major efforts to improve achievement, according 
to CEP’s case studies and nationally representative survey of school districts. According 
to this year’s district survey, the following four strategies were most often considered 
successful in raising achievement in Title I schools identified for improvement under 



Answering the Question That Matters Most 




NCLB: hiring additional teachers to reduce class size (cited as at least somewhat successful 
by 97% of districts with such schools), providing assistance through school support 
teams (95%), increasing the quality and quantity of teacher and principal professional 
development (92%), and aligning curriculum and instruction with standards and/or 
assessments (91%) (CEP, 2007a). 3 As noted in chapter 2, however, it is not possible to 
sort out how much of the impetus for these types of changes has come from NCLB and 
how much from state and local reforms. In CEP’s surveys, roughly 7 of 10 district 
respondents cited school district programs and policies unrelated to NCLB as important 
causes of improved achievement in reading and math, and more than a third cited state 
programs and policies (CEP, 2007a). 

• Teaching to the test. Teaching to the test can be a positive practice when it means aligning 
curricula to well-designed standards and tests and ensuring that classroom teaching covers 
the most important knowledge and skills contained in those standards (CEP, 2002). 
Teaching to the test can have adverse effects, however, if it means narrowing the curricu- 
lum to include only the subjects, topics, and skills that are likely to appear on state tests. 
This latter practice can raise test scores without also improving students’ mastery of the 
broader subject being tested. It can give the false impression that student achievement is 
rising when students are actually learning the same amount or less; this is sometimes 
referred to as “score inflation” (Koretz, 2005). CEP’s past district surveys and case studies 
found evidence that many school districts are reducing time in other subjects to allow 
more time for reading and math (CEP, 2006). 4 

• More lenient tests, scoring, or data analyses. We were careful not to compare test data 
when we were aware of breaks in comparability due to major changes in testing systems. 
But as explained in chapter 3, test results can still be subtly manipulated through a series 
of small decisions that affect such factors as equating, scoring, and proficiency levels and 
that amount to tinkering rather than substantial alterations. Faced with intense pressure 
to show achievement gains, state education officials may be likely to err on the side of 
leniency when making these types of decisions. It is difficult to find evidence of these 
types of subtle changes to state testing programs; however, we do know that some of the 
changes that states have made to their NCLB accountability plans have increased the 
numbers of students counted as proficient (CEP, 2007c). 

• Changes in populations tested. Changes in the student population tested from year to year 
can affect aggregate state test scores. To cite one example, if significantly more students are 
held back in grade, it could appear that achievement in a particular grade has increased 
from one year to the next; for instance, the students who are retained in 4 th grade may do 
better on the 4 th grade tests after repeating a grade, while the cohort of students in 5 th grade 
will not include the lowest-achieving students who had not been promoted. To cite a con- 
trasting example, if one year’s cohort of test-takers includes a significantly higher propor- 
tion from historically low-performing subgroups, such as limited-English-proficient 
students, than the previous year’s cohort did, achievement may appear to decrease in the 
aggregate, but the apparent decrease is a consequence of demography rather than learning. 

Phase II of this study, which involves actual visits to a subset of states and interviews with 

state officials, will explore alternative explanations for test score trends in participating states. 



3 More detailed information about strategies for raising achievement in schools identified for improvement is included in a new 
report from CEP, scheduled for release in June 2007. 

4 Additional information about changes in instructional time, curriculum emphasis, and test preparation related to NCLB is 
included in a second new report from CEP, also scheduled for release in June 2007. 



Pre- and Post-NCLB Trends 



Knowing whether test scores have improved since NCLB took effect is only part of the 
national picture. Since many states began implementing reform efforts well before NCLB 
was enacted, it is also important to determine whether the pace of improvement has sped 
up or slowed down since 2002. To make this determination, we compared average yearly 
gains in achievement before and after the laws enactment in 2002, using both percentages 
proficient and effect sizes, where available. In our analysis, the pre-NCLB period ended at 
2002, and the post-NCLB period started at 2002. 

Only 13 states met the criteria necessary to make pre- and post-NCLB comparisons — 
at least two years of data before and after 2002. 



STATES WITH GREATER POST-NCLB GAINS 

Of the 13 states with sufficient pre- and post-NCLB data, 9 made greater average yearly 
gains in achievement after NCLB was enacted than before, by most indicators. They include 
Kansas, Kentucky, Louisiana, New Hampshire, New Jersey, New York, Pennsylvania, 
Washington, and Wyoming. 

Table 9 compares average yearly gains in achievement before and after NCLB in each of 
these nine states. Pre- and post-NCLB comparisons are made for every grade span that had 
sufficient trend data, using percentages proficient and effect sizes where available. In the 
table, each measure (percentage proficient or effect size) and each grade and subject (such as 
grade 4 reading or grade 4 math) is counted as one point of comparison. In the far right col- 
umn is a statement summarizing how many points of comparison showed greater post- 
NCLB gains than pre-NCLB gains. For example, Kansas had 12 points of 
comparison — two measures (percentages proficient and effect size) times six grades (three 
grades in reading and three in math). New York had four points of comparison — one meas- 
ure (percentages proficient) times four grades (two in reading and two in math). In Kansas, 
post-NCLB gains exceeded pre-NCLB gains on all 12 points of comparison; in New York, 
post-NCLB gains exceeded pre-NCLB gains on three of the four points of comparison. 




Table 9. Average Yearly Gains by Subject and Grade for States with 
Greater Overall Gains After NCLB Than Before 



State, Years of 


Average Yearly 


Average Yearly 


Comparisons of 


Data, Subject, 


Percentage Point Gain 


Effect Size Gain 


Pre-NCLB v. 


Grade Level 


Pre-NCLB 


Post-NCLB 


Pre-NCLB 


Post-NCLB 


Post-NCLB Gains 


Kansas 

2000-2005 


Reading 5 


-0.1 


4-9 


0.00 


0.14 




Reading 8 


-0.5 


3-5 


-0.01 


0.10 




Reading 11 


-0.9 


2.9 


-0.03 


0.08 


Post-NCLB gains exceed pre-NCLB 


Math 4 


2.6 


5-7 


0.08 


0.20 


gains on 12 of 12 comparisons 


Math 7 


1.1 


4.1 


0.03 


0.11 




Math 10 


0.7 


2.6 


0.02 


0.06 





Center on Education Policy 





Answering the Question That Matters Most 



State, Years of Average Yearly Average Yearly Comparisons of 



Data, Subject, 


Percentage Point Gain 


Effect Size Gain 


Pre-NCLB v. 


Grade Level 


Pre-NCLB 


Post-NCLB 


Pre-NCLB 


Post-NCLB 


Post-NCLB Gains 


Kentucky 

1999-2006 


Reading 4 


1-3 


2-7 


0.03 


0.06 




Reading 7 


1-7 


2.0 


0.03 


0.05 




Reading 10 


1-7 


3-3 


0.04 


0.07 


Post-NCLB gains exceed pre-NCLB 


Math 5 


2-7 


5-3 


0.08 


0.11 


gains on 11 of 12 comparisons 


Math 8 


1.0 


2.0 


0.05 


0.06 




Math 11 


1-7 


2.0 


0.06 


0.04 




Louisiana 

1999-2006 


Reading 4 


0.7 


1.8 


0.04 


0.02 




Reading 8 


1-7 


1.8 


0.06 


0.03 


Post-NCLB gains exceed pre-NCLB 


Math 4 


2-7 


3.0 


0.08 


0.08 


gains on 5 of 8 comparisons 


Math 8 


1.0 


3.0 


0.04 


0.05 




New Hampshire 
Years vary as shown 


Reading 3 
2000-2004 


1-5 


1.0 








Reading 6 
2000-2004 


-0.5 


6.0 








Reading 10 


1-5 


3.0 








2000-2006 


No effect size 


Post-NCLB gains exceed pre-NCLB 


Math 3 


-0.5 


5-5 


data available 


gains on 5 of 6 comparisons 


2000-2004 

Math 6 
2000-2004 


0.5 


2-5 








Math 10 
2000-2006 


-0.5 


3.8 








New Jersey 
1999-2006 


Reading 8 


-1 *4 


0.3 


-0.02 


-0.03 




Math 4 


2-7 


3-5 


0.08 


0.10 


Post-NCLB gains exceed pre-NCLB 


Math 8 


-1.0 


1.6 


-0.04 


0.03 


gains on 5 of 6 comparisons 


New York 
1999-2005 


Reading 4 


4-3 


3.0 








Reading 8 


- 1-3 


1-3 


No effect size 


Post-NCLB gains exceed pre-NCLB 


Math 4 


0.0 


6.0 


data available 


gains on 3 of 4 comparisons 


Math 8 


2.3 


3-5 










State, Years of 


Average Yearly 


Average Yearly 


Comparisons of 


Data, Subject, 


Percentage Point Gain 


Effect Size Gain 


Pre-NCLB v. 


Grade Level 


Pre-NCLB 


Post-NCLB 


Pre-NCLB 


Post-NCLB 


Post-NCLB Gains 


Pennsylvania 1 


Reading 5 


0.9 


0.9 


0.02 


-O.Oi 




Reading 8 


■ 1-3 


3.0 


0.00 


0.12 




Reading 11 


0.9 


1-5 


0.03 


0.04 


Post-NCLB gains exceed pre-NCLB 


Math 5 


0.1 


3-5 


0.03 


0.11 


gains on 8 of 12 comparisons 


Math 8 


0.7 


2.6 


0.03 


0.06 




Math 11 


1-7 


0.6 


0.03 


0.02 





Washington 

1999-2006 



Reading 4 


2.2 


3.8 


0.05 0.13 




Reading 7 


1-3 


4.1 


0.03 0.22 




Reading 10 


2.6 


5.6 


0.06 0.11 


Post-NCLB gains exceed pre-NCLB 


Math 4 


4.8 


1.8 


0.14 0.08 


gains on 9 of 12 comparisons 


Math 7 


2.1 


4-5 


0.06 0.17 




Math 10 


1.4 


3-4 


0.05 0.05 




Wyoming 

1999-2005 


Reading 4 


0.0 


1.0 






Reading 8 


-0.7 


0.3 






Reading 11 


-.03 


1-7 


No effect size 


Post-NCLB gains exceed pre-NCLB 


Math 4 


-0.7 


2.0 


data available 


gains on 5 of 6 comparisons 


Math 8 


1.0 


1-7 






Math 11 


2-7 


2-7 






Table reads: Kansas, which had six years of comparable percentage proficient and effect size data (2000-2005), had a 
pre-NCLB average yearly decline in grade 5 reading of 0.1 percentage points and a post-NCLB average yearly gain of 
4.9 percentage points. In terms of effect size, Kansas had a pre-NCLB average yearly gain in grade 5 reading of 0.00 
standard deviation units and a post-NCLB average yearly gain of 0.14 standard deviation units (14% of a standard 
deviation). Post-NCLB gains exceeded pre-NCLB gains on all 12 points of comparison in Kansas. 



Note: Italics signify that effect sizes showed a different trend than percentages proficient. 



1 Pennsylvania had percentages proficient for 2001-2006 and effect sizes for 1999-2006. 




STATES WITH GREATER PRE-NCLB GAINS 

Of the 13 states with pre- and post-NCLB achievement data, 4 states, while not showing 
overall declines, experienced slower rates of increase by most indicators after NCLB took 
effect. These states include Delaware, Massachusetts, Oregon, and Virginia. 

Table 10 is similar to table 9, except that it compares pre- and post-NCLB average yearly 
gains in states that had greater gains before 2002 than after. 



Center on Education Policy 



